Dynamic control graphs for analysis of coordination-centric software designs

ABSTRACT

Static analysis can be of great benefit in debugging complex systems. Traditional runtime debugging is necessary because certain software errors cannot be detected until after they are compiled into execution errors. Static analysis can reduce the number of such errors and can aid designers by illuminating subtle design interactions. Disclosed are various systems and methods for static analysis that can be applied to coordination-centric systems, including typechecking, consistency checking, and conflict detection through automatically derived abstract views, and model checking. The static analyses presented here comprise a form of preemptive debugging for coordination-centric software systems.

RELATED APPLICATIONS

[0001] This application is a continuation of U.S. ProvisionalApplication No. 60/213,496 filed Jun. 23, 2000, incorporated herein byreference.

TECHNICAL FIELD

[0002] The present invention relates to static error checking ofsoftware systems designed for execution in a hardware architecture withmultiple processing resources.

BACKGROUND OF THE INVENTION

[0003] A system design and programming methodology is most effectivewhen it is closely integrated and coheres tightly with its correspondingdebugging techniques. In distributed and embedded system methodologies,the relationship between debugging approaches and design methodologieshas traditionally been one-sided in favor of the design and programmingmethodologies. Design and programming methodologies are typicallydeveloped without any consideration for the debugging techniques thatwill later be applied to software systems designed using that design andprogramming methodology. While these typical debugging approachesattempt to exploit features provided by the design and programmingmethodologies, the debugging techniques will normally have little or noimpact on what the design and programming features are in the firstplace. This lack of input from debugging approaches to design andprogramming methodologies serves to maintain the role of debugging as anafterthought, even though in a typical system design, debugging consumesa majority of the design time. The need remains for a design andprogramming methodology that reflects input from, and consideration of,potential debugging approaches in order to enhance the design and reducethe implementation time of software systems.

[0004] 1. Packaging of Software Elements

[0005] Packaging refers to the set of interfaces a software elementpresents to other elements in a system. Software packaging has manyforms in modern methodologies. Some examples are programming languageprocedure call interfaces (as with libraries), TCP/IP socket interfaceswith scripting languages (as with mail and Web servers), and fileformats. Several typical prior art packaging styles are described below,beginning with packaging techniques used in object-oriented programminglanguages and continuing with a description of more generalizedapproaches to packaging.

[0006] A. Object-Oriented Approaches to Packaging

[0007] One common packaging style is based on object-orientedprogramming languages and provides procedure-based (method-based)packaging for software elements (objects within this framework). Theseprocedure-based packages allow polymorphism (in which several types ofobjects can have identical interfaces) through subtyping, and codesharing through inheritance (deriving a new class of objects from analready existing class of objects). In a typical object-orientedprogramming language, an object's interface is defined by the object'smethods.

[0008] Object-oriented approaches are useful in designing concurrentsystems (systems with task level parallelism and multiple processingresources?) because of the availability of active objects (objects witha thread of control). Some common, concurrent object-oriented approachesare shown in actor languages and in concurrent Eiffel.

[0009] Early object-oriented approaches featured anonymity of objectsthrough dynamic typechecking. This anonymity of objects meant that afirst object did not need to know anything about a second object inorder to send a message to the second object. One unfortunate result ofthis anonymity of objects was that the second object could unexpectedlyrespond to the first object that the sent message was not understood,resulting in a lack of predictability, due to this disruption of systemexecutions, for systems designed with this object-oriented approach.

[0010] Most modern object-oriented approaches opt to sacrifice thebenefits flowing from anonymity of objects in order to facilitatestronger static typing (checking to ensure that objects will properlycommunicate with one another before actually executing the softwaresystem). The main result of stronger static typing is improved systempredictability. However, an unfortunate result of sacrificing theanonymity of objects is a tighter coupling between those objects,whereby each object must explicitly classify, and include knowledgeabout, other objects to which it sends messages. In modernobject-oriented approaches the package (interface) has becomeindistinguishable from the object and the system in which the object isa part.

[0011] The need remains for a design and programming methodology thatcombines the benefits of anonymity for the software elements with thebenefits derived from strong static typing of system designs.

[0012] B. Other Approaches to Packaging

[0013] Other packaging approaches provide higher degrees of separationbetween software elements and their respective packages than does thepackaging in object-oriented systems. For example, the packages inevent-based frameworks are interfaces with ports for transmitting andreceiving events. These provide loose coupling for interelementcommunication. However, in an event-based framework, a software designermust explicitly implement interelement state coherence between softwareelements as communication between those software elements. This meansthat a programmer must perform the error-prone task of designing,optimizing, implementing, and debugging a specialized communicationprotocol for each state coherence requirement in a particular softwaresystem.

[0014] The common object request broker architecture (CORBA) provides aninterface description language (IDL) for building packages aroundsoftware elements written in a variety of languages. These packages areremote procedure call (RPC) based and provide no support forcoordinating state between elements. With flexible packaging, anelement's package is implemented as a set of co-routines that can beadapted for use with applications through use of adapters withinterfaces complementary to the interface for the software element.These adapters can be application-specific—used only when the elementsare composed into a system.

[0015] The use of co-routines lets a designer specify transactions orsequences of events as part of an interface, rather than just as atomicevents. Unfortunately, co-routines must be executed in lock-step,meaning a transition in one routine corresponds to a transition in theother co-routine. If there is an error in one or if an expected event islost, the interface will fail because its context will be incorrect torecover from the lost event and the co-routines will be out of sync.

[0016] The need remains for a design and programming methodology thatprovides software packaging that supports the implementation of statecoherence in distributed concurrent systems without packaging orinterface failure when an error or an unexpected event occurs.

[0017] 2. Approaches to Coordination

[0018] Coordination, within the context of this application, means thepredetermined ways through which software components interact. In abroader sense, coordination refers to a methodology for composingconcurrent components into a complete system. This use of the termcoordination differs slightly from the use of the term in theparallelizing compiler literature, in which coordination refers to atechnique for maintaining programwide semantics for a sequential programdecomposed into parallel subprograms.

[0019] A. Coordination Languages

[0020] Coordination languages are usually a class of tuple-spaceprogramming languages, such as Linda. A tuple is a data objectcontaining two or more types of data that are identified by their tagsand parameter lists. In tuple-space languages, coordination occursthrough the use of tuple spaces, which are global multisets of taggedtuples stored in shared memory. Tuple-space languages extend existingprogramming languages by adding six operators: out, in, read, eval, inp,and readp. The out, in, and read operators place, fetch and remove, andfetch without removing tuples from tuple space. Each of these threeoperators blocks until its operation is complete. The out operatorcreates tuples containing a tag and several arguments. Procedure callscan be included in the arguments, but since out blocks, the calls mustbe performed and the results stored in the tuple before the operator canreturn.

[0021] The operators eval, inp, and readp are nonblocking versions ofout, in, and read, respectively. They increase the expressive power oftuple-space languages. Consider the case of eval, the nonblockingversion of out. Instead of evaluating all arguments of the tuple beforereturning, it spawns a thread to evaluate them, creating, in effect, anactive tuple (whereas tuples created by out are passive). As with out,when the computation is finished, the results are stored in a passivetuple and left in tuple space. Unlike out, however, the eval callreturns immediately, so that several active tuples can be leftoutstanding.

[0022] Tuple-space coordination can be used in concise implementationsof many common interaction protocols. Unfortunately, tuple-spacelanguages do not separate coordination issues from programming issues.Consider the annotated Linda implementation of RPC in Listing 1.

[0023] Listing 1: Linda Used to Emulate RPC Listing 1: Linda used toemulate RPC: rpcCall(args) { /* C */ out(“ RPCToServer”, “Client”, args. . .); in(“ Client, “ ReturnFromServer”, &returnValue); returnreturnValue; /* C */ /* C */ } Server: . . . while(true) { /* C */ in(“RPCToServer” , &returnAddress, args. . .); returnValue =functionCall(args); /* C */ out(returnAddress, “ ReturnFromServer”,returnValue); } /* C */

[0024] Although the implementation depicted in Listing 1 is a compactrepresentation of an RPC protocol, the implementation still dependsheavily on an accompanying programming language (in this case, C). Thisdependency prevents designers from creating a new Linda RPC operator forarbitrary applications of RPC. Therefore, every time a designer usesLinda for RPC, they must copy the source code for RPC or make a C-macro.This causes tight coupling, because the client must know the name of theRPC server. If the server name is passed in as a parameter, flexibilityincreases; however, this requires a binding phase in which the name isobtained and applied outside of the Linda framework.

[0025] The need remains for a design and programming methodology thatallows implementation of communication protocols without tight couplingbetween the protocol implementation and the software elements with whichthe protocol implementation works.

[0026] A tuple space can require large quantities of dynamicallyallocated memory. However, most systems, and especially embeddedsystems, must operate within predictable and sometimes small memoryrequirements. Tuple-space systems are usually not suitable forcoordination in systems that must operate within small predictablememory requirements because once a tuple has been generated, it remainsin tuple space until it is explicitly removed or the software elementthat created it terminates. Maintaining a global tuple space can be veryexpensive in terms of overall system performance. Although much work hasgone into improving the efficiency of tuple-space languages, systemperformance remains worse with tuple-space languages than withmessage-passing techniques.

[0027] The need remains for a design and programming methodology thatcan effectively coordinate between software elements while respectingperformance and predictable memory requirements.

[0028] B. Fixed Coordination Models

[0029] In tuple-space languages, much of the complexity of coordinationremains entangled with the functionality of computational elements. Anencapsulating coordination formalism decouples intercomponentinteractions from the computational elements.

[0030] This type of formalism can be provided by fixed coordinationmodels in which the coordination style is embodied in an entity andseparated from computational concerns. Synchronous coordination modelscoordinate activity through relative schedules. Typically, theseapproaches require the coordination protocol to be manually constructedin advance. In addition, computational elements must be tailored to thecoordination style used for a particular system (which may requireintrusive modification of the software elements).

[0031] The need remains for a design and programming methodology thatallows for coordination between software elements without tailoring thesoftware elements to the specific coordination style used in aparticular software system while allowing for interactions betweensoftware elements is a way that facilitates debugging complex systems.

SUMMARY OF THE INVENTION

[0032] Static analysis can be of great benefit in debugging complexsystems. Traditional runtime debugging is necessary because certainsoftware errors cannot be detected until after they are compiled intoexecution errors. Static analysis can reduce the number of such errorsand can aid designers by illuminating subtle design interactions. Thepresent invention relates to various types of static analysis that canbe applied to coordination-centric software systems, includingtypechecking, consistency checking, conflict detection throughautomatically derived abstract views, and model checking. The staticanalyses presented here comprise a form of preemptive debugging.

[0033] Dynamic control graphs (DCGs) lend themselves to a wider varietyof dynamic checks and to model checking as well. Model checking enablesmany more system checks, such as configuration reachability. To applymodel checking to DCGs, it is first necessary to derive a transitionrelation and represent it as a binary decision diagram (BDD). The sizeof BDDs is sensitive to variable ordering, but the structure of DCGsenables ordering heuristics that typically result in BDDs of reasonablesize.

[0034] Static analysis is used not only to find errors or bugs in asystem, but to optimize a program as well. This is particularly evidentin the case of control dataflow graphs (CDGs). Using these, efficientstatic schedules can be derived for modal dataflow regions with constantproduction and consumption rates.

[0035] Additional aspects and advantages of this invention will beapparent from the following detailed description of preferredembodiments thereof, which proceeds with reference to the accompanyingdrawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0036]FIG. 1 is a component in accordance with the present invention.

[0037]FIG. 2 is the component of FIG. 1 further having a set ofcoordination interfaces.

[0038]FIG. 3A is a prior art round-robin resource allocation protocolwith a centralized controller.

[0039]FIG. 3B is a prior art round-robin resource allocation protocolimplementing a token passing scheme.

[0040]FIG. 4A is a detailed view of a component and a coordinationinterface connected to the component for use in round-robin resourceallocation in accordance with the present invention.

[0041]FIG. 4B depicts a round-robin coordinator in accordance with thepresent invention.

[0042]FIG. 5 shows several typical ports for use in a coordinationinterface in accordance with the present invention.

[0043]FIG. 6A is a unidirectional data transfer coordinator inaccordance with the present invention.

[0044]FIG. 6B is a bidirectional data transfer coordinator in accordancewith the present invention.

[0045]FIG. 6C is a state unification coordinator in accordance with thepresent invention.

[0046]FIG. 6D is a control state mutex coordinator in accordance withthe present invention.

[0047]FIG. 7 is a system for implementing subsumption resourceallocation having components, a shared resource, and a subsumptioncoordinator.

[0048]FIG. 8 is a barrier synchronization coordinator in accordance withthe present invention.

[0049]FIG. 9 is a rendezvous coordinator in accordance with the presentinvention.

[0050]FIG. 10 depicts a dedicated RPC system having a client, a server,and a dedicated RPC coordinator coordinating the activities of theclient and the server.

[0051]FIG. 11 is a compound coordinator with both preemption andround-robin coordination for controlling the access of a set ofcomponents to a shared resource.

[0052]FIG. 12A is software system with two data transfer coordinators,each having constant message consumption and generation rules and eachconnected to a separate data-generating component and connected to thesame data-receiving component.

[0053]FIG. 12B is the software system of FIG. 12A in which the two datatransfer coordinators have been replaced with a merged data transfercoordinator.

[0054]FIG. 13 is a system implementing a first come, first servedresource allocation protocol in accordance with the present invention.

[0055]FIG. 14 is a system implementing a multiclient RPC coordinationprotocol formed by combining the first come, first served protocol ofFIG. 13 with the dedicated RPC coordinator of FIG. 10.

[0056]FIG. 15 depicts a large system in which the coordination-centricdesign methodology can be employed having a wireless device interactingwith a cellular network.

[0057]FIG. 16 shows a top-level view of the behavior and components fora system for a cell phone.

[0058]FIG. 17A is a detailed view of a GUI component of the cell phoneof FIG. 16.

[0059]FIG. 17B is a detailed view of a call log component of the cellphone of FIG. 16.

[0060]FIG. 18A is a detailed view of a voice subsystem component of thecell phone of FIG. 16.

[0061]FIG. 18B is a detailed view of a connection component of the cellphone of FIG. 16.

[0062]FIG. 19 depicts the coordination layers between a wireless deviceand a base station, and between the base station and a switching center,of FIG. 15.

[0063]FIG. 20 depicts a cell phone call management component, a masterswitching center call management component, and a call managementcoordinator connecting the respective call management components.

[0064]FIG. 21A is a detailed view of a transport component of theconnection component of FIG. 18B.

[0065]FIG. 21B is a CDMA data modulator of the transport component ofFIG. 18B.

[0066]FIG. 22 is a detailed view of a typical TDMA and a typical CDMAsignal for the cell phone of FIG. 16.

[0067]FIG. 23A is a LCD touch screen component for a Web browser GUI fora wireless device.

[0068]FIG. 23B is a Web page formatter component for the Web browser GUIfor the wireless device.

[0069]FIG. 24A is a completed GUI system for a handheld Web browser.

[0070]FIG. 24B shows the GUI system for the handheld Web browsercombined with the connection subsystem of FIG. 18B in order to accessthe cellular network of FIG. 15.

[0071]FIG. 25 is a typical space/time diagram with space represented ona vertical axis and time represented on a horizontal axis.

[0072]FIG. 26 is a space/time diagram depicting a set of system eventsand two different observations of those system events.

[0073]FIG. 27 is a space/time diagram depicting a set of system eventsand an ideal observation of the events taken by a real-time observer.

[0074]FIG. 28 is a space/time diagram depicting two different yet validobservations of a system execution.

[0075]FIG. 29 is a space/time diagram depicting a system execution andan observation of that execution take by a discrete lamport observer.

[0076]FIG. 30 is a space/time diagram depicting a set of events thateach include a lamport time stamp.

[0077]FIG. 31 is a space/time diagram illustrating the insufficiency ofscalar timestamps to characterize causality between events.

[0078]FIG. 32 is a space/time diagram depicting a set of system eventsthat each a vector time stamp.

[0079]FIG. 33 depicts a display from a Partial Order Event Tracer(POET).

[0080]FIG. 34 is a space/time diagram depicting two compound events thatare neither causal nor concurrent.

[0081]FIG. 35 is a POET display of two convex event clusters.

[0082]FIG. 36 is a basis for distributed event environments (BEE)abstraction facility for a single client.

[0083]FIG. 37 is a hierarchical tree construction of process clusters.

[0084]FIG. 38A depicts a qualitative measure of cohesion and couplingbetween a set of process clusters that have heavy communication or areinstantiated from the same source code.

[0085]FIG. 38B depicts a qualitative measure of cohesion and couplingbetween a set of process clusters that do not have heavy communicationor are not instances of the same source code.

[0086]FIG. 38C depicts a qualitative measure of cohesion and couplingbetween an alternative set of process clusters that have heavycommunication or are instantiated from the same source code.

[0087]FIG. 39 depicts a consistent and an inconsistent cut of a systemexecution on a space/time diagram.

[0088]FIG. 40A is a space/time diagram depicting a system execution.

[0089]FIG. 40B is a lattice representing all possible consistent cuts ofthe space/time diagram of FIG. 40A.

[0090]FIG. 40C is a graphical representation of the possible consistentcuts of FIG. 40B.

[0091]FIG. 41A is a space/time diagram depicting a system execution.

[0092]FIG. 41B is the space/time diagram of FIG. 41A after performing aglobal-step.

[0093]FIG. 41C is the space/time diagram of FIG. 41A after performing astep-over.

[0094]FIG. 41D is the space/time diagram of FIG. 41A after performing astep-in.

[0095]FIG. 42 is a space/time diagram depicting a system that is subjectto a domino effect whenever the system is rolled back in time to acheckpoint.

[0096]FIG. 43 depicts a simple static control graph in accordance withthe present invention.

[0097]FIG. 44A is an edge that asserts the value true at its head and isresponsive to the value true at its tail.

[0098]FIG. 44B is an edge that asserts the value true at its head and isresponsive to the value false at its tail.

[0099]FIG. 44C is an edge that asserts the value false at its head andis responsive to the value true at its tail.

[0100]FIG. 44D is an edge that asserts the value false at its head andis responsive to the value false at its tail.

[0101]FIG. 45 is an illustration of the semantic differences betweenBoolean networks and static control graphs.

[0102]FIG. 46A shows a basic control graph with reduced characteristicfunctions.

[0103]FIG. 46B shows a basic control graph with reduced characteristicfunctions.

[0104]FIG. 47A shows the impact of edge semantics on SCG semantics whenan enforcing edge asserts a value of false at its head.

[0105]FIG. 47B shows the impact of edge semantics on SCG semantics whenan enforcing edge is responsive to a value of false at its tail.

[0106]FIG. 47C shows the impact of edge semantics on SCG semantics whena sensing edge asserts a value of false at its head.

[0107]FIG. 47D shows the impact of edge semantics on SCG semantics whena sensing edge is responsive to a value of false at its tail.

[0108]FIG. 48A depicts a coordinator for a rendezvous-stylecoordination.

[0109]FIG. 48B is the static control graph that represents a coordinatorfor a rendezvous-style coordination.

[0110]FIG. 49 is a static control graph with no stable andnon-conflicting states.

[0111]FIG. 50A is the static control graph that represents the reductionfrom 3-SAT to Any Stable State Property.

[0112]FIG. 50B is a logic representation for the static control graphthat represents the reduction from 3-SAT to Any Stable State Property.

[0113]FIG. 50C is an equivalent logic representation for the staticcontrol graph that represents the reduction from 3-SAT to Any StableState Property.

[0114]FIG. 51A is a static control graph showing a first order conflict.

[0115]FIG. 51B is a static control graph showing a second orderconflict.

[0116]FIG. 52A is a static control graph showing a disjunctive node(d_(i)) before “flattening.”

[0117]FIG. 52B is the static control graph showing the disjunctive node(d_(i)) after “flattening.”

[0118]FIG. 53 is a static control graph that shows a potential conflict,highlighted by “flattening.”

[0119]FIG. 54 is the static control graph for FIG. 49, after“flattening.”

[0120]FIG. 55A shows a component's inner static control graph.

[0121]FIG. 55B shows the component's static control graph after theinternal nodes are summarized as a single, independent node.

[0122]FIG. 56 shows an ideal structure for applying hierarchicalreduction to highlight instability in a system.

[0123]FIG. 57 depicts a dynamic control graph (DCG).

[0124]FIG. 58 depicts a DCG with an action node.

[0125]FIG. 59 shows a DCG with an action node for an action that istransparent with respect to control interactions.

[0126]FIG. 60 shows a DCG with an action node for an action that isopaque with respect to control interactions.

[0127]FIG. 61A shows a DCG for a rendezvous coordinator

[0128]FIG. 61B shows a DCG for a rendezvous coordinator with twoparticipant preemption.

[0129]FIG. 62A shows a communication channel between partitions of astatic control graph (SCG) that can cause an action-only barrier.

[0130]FIG. 62B shows a DCG corresponding to the SCG of FIG. 62A afterpartitioning across the access-only barrier.

[0131]FIG. 63, depicts constraint edges that cross the action onlybarrier, from FIG. 62A, and their corresponding templates.

[0132]FIG. 64 depicts a current DCG along with a next DCG, which is theresult of temporally unrolling DCG.

[0133]FIG. 65A depicts a simple DCG.

[0134]FIG. 65B depicts an unrolled DCG for simple DCG.

[0135]FIG. 66A shows a truth table for a boolean “and” function.

[0136]FIG. 66B shows a truth tree that corresponds to the truth table inFIG. 66A.

[0137]FIG. 67A shows a reduced binary decision diagram (BDD) for thetruth tree in FIG. 66B.

[0138]FIG. 67B shows an alternate reduced BDD for the truth tree in FIG.66B.

[0139]FIG. 67C shows a third alternate reduced BDD for the truth tree inFIG. 66B.

[0140]FIG. 68 shows the results of using the apply algorithm to grow aBDD, which represents the characteristic function of the unrolled DCGfrom FIG. 65B.

[0141]FIG. 69A shows an unrolled DCG for the rendezvous DCG of FIG. 61A.

[0142]FIG. 69B shows that a critical factor in ordering for thecharacteristic function of the unrolled DCG of FIG. 69A is the relativeordering of wait_(c), wait_(b), and wait_(a).

[0143]FIGS. 70A, B, C, and D show that interleaving asynchronous modelsis not a sufficient static error-checking tool for coordination-centricsystem designs.

[0144]FIG. 71 shows a control/dataflow graph (CDG).

[0145]FIG. 72 shows a CDG representation of an RPC system.

[0146]FIG. 73A shows the second step in partitioning a CDG across anaction-only barrier.

[0147]FIG. 73B shows the graph of DCG from FIG. 62B, with theaction-only barrier transformed to a message-only barrier.

[0148]FIG. 74A shows a CDG with a set of message rate guarantees.

[0149]FIG. 74B shows a dataflow graph based on the CDG of FIG. 74A.

[0150]FIG. 75 shows a dataflow graph.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENT

[0151] Coordination-Centric Software Design

[0152]FIG. 1 is an example of a component 100, which is the basicsoftware element within the coordination-centric design framework, inaccordance with the present invention. With reference to FIG. 1,component 100 contains a set of modes 102. Each mode 102 corresponds toa specific behavior associated with component 100. Each mode 102 caneither be active or inactive, respectively enabling or disabling thebehavior corresponding to that mode 102. Modes 102 can make theconditional aspects of the behavior of component 100 explicit. Thebehavior of component 100 is encapsulated in a set of actions 104, whichare discrete, event-triggered behavioral elements within thecoordination-centric design methodology. Component 100 can be copied andthe copies of component 100 can be modified, providing the code-sharingbenefits of inheritance.

[0153] Actions 104 are enabled and disabled by modes 102, and hence canbe thought of as effectively being properties of modes 102. An event(not shown) is an instantaneous condition, such as a timer tick, a datadeparture or arrival, or a mode change. Actions 104 can activate anddeactivate modes 102, thereby selecting the future behavior of component100. This is similar to actor languages, in which methods are allowed toreplace an object's behavior.

[0154] In coordination-centric design, however, all possible behaviorsmust be identified and encapsulated before runtime. For example, adesigner building a user interface component for a cell phone mightdefine one mode for looking up numbers in an address book (in which theuser interface behavior is to display complete address book entries informatted text) and another mode for displaying the status of the phone(in which the user interface behavior is to graphically display thesignal power and the battery levels of the phone). The designer mustdefine both the modes and the actions for the given behaviors wellbefore the component can be executed.

[0155]FIG. 2 is component 100 further including a first coordinationinterface 200, a second coordination interface 202, and a thirdcoordination interface 204. Coordination-centric design's components 100provide the code-sharing capability of object-oriented inheritancethrough copying. Another aspect of object-oriented inheritance ispolymorphism through shared interfaces. In object-oriented languages, anobject's interface is defined by its methods. Althoughcoordination-centric design's actions 104 are similar to methods inobject-oriented languages, they do not define the interface forcomponent 100. Components interact through explicit and separatecoordination interfaces, in this figure coordination interfaces 200,202, and 204. The shape of coordination interfaces 200, 202, and 204determines the ways in which component 100 may be connected within asoftware system. The way coordination interfaces 200, 202, and 204 areconnected to modes 102 and actions 104 within component 100 determineshow the behavior of component 100 can be managed within a system.Systemwide behavior is managed through coordinators (see FIG. 4B andsubsequent).

[0156] For our approach to be effective, several factors in the designof software elements must coincide: packaging, internal organization,and how elements coordinate their behavior. Although these are oftentreated as independent issues, conflicts among them can exacerbatedebugging. We handle them in a unified framework that separates theinternal activity from the external relationship of component 100. Thislets designers build more modular components and encourages them tospecify distributable versions of coordination protocols. Components canbe reused in a variety of contexts, both distributed, and singleprocessor 1.

[0157] 1. Introduction to Coordination

[0158] Within this application, coordination refers to the predeterminedways by which components interact. Consider a common coordinationactivity: resource allocation. One simple protocol for this isround-robin: participants are lined up, and the resource is given toeach participant in turn. After the last participant is served, theresource is given back to the first. There is a resource-schedulingperiod during which each participant gets the resource exactly once,whether or not it is needed.

[0159]FIG. 3A is prior art round-robin resource allocation protocol witha centralized controller 300, which keeps track of and distributes theshared resource (not shown) to each of software elements 302, 304, 306,308, and 310 in turn. With reference to FIG. 3A, controller 300 alonedetermines which software element 302, 304, 306, 308, or 310 iscurrently allowed to use the resource and which has it next. Thisimplementation of a round-robin protocol permits software elements 302,304, 306, 308, and 310 to be modular, because only controller 300 keepstrack of the software elements. Unfortunately, when this implementationis implemented on a distributed architecture (not shown), controller 300must typically be placed on a single processing element (not shown). Asa result, all coordination requests must go through that processingelement, which can cause a communication performance bottleneck. Forexample, consider the situation in which software elements 304 and 306are implemented on a first processing element (not shown) and controller300 is implemented on a second processing element. Software element 304releases the shared resource and must send a message indicating this tocontroller 300. Controller 300 must then send a message to softwareelement 306 to inform software element 306 that it now has the right tothe shared resource. If the communication channel between the firstprocessing resource and the second processing resource is in use or thesecond processing element is busy, then the shared resource must remainidle, even though both the current resource holder and the next resourceholder (software elements 304 and 306 respectively) are implemented onthe first processing element (not shown). The shared resource musttypically remain idle until communication can take place and controller300 can respond. This is an inefficient way to control access to ashared resource.

[0160]FIG. 3B is a prior art round-robin resource allocation protocolimplementing a token passing scheme. With reference to FIG. 3B, thissystem consists of a shared resource 311 and a set of software elements312, 314, 316, 318, 320, and 322. In this system a logical token 324symbolizes the right to access resource 311, i.e., when a softwareelement holds token 324, it has the right to access resource 311. Whenone of software elements 312, 314, 316, 318, 320, or 322 finishes withresource 311, it passes token 324, and with token 324 the access right,to a successor. This implementation can be distributed without acentralized controller, but as shown in FIG. v3B, this is less modular,because it requires each software element in the set to keep track of asuccessor.

[0161] Not only must software elements 312, 314, 316, 318, 320, and 322keep track of successors, but each must implement a potentiallycomplicated and error-prone protocol for transferring token 324 to itssuccessor. Bugs can cause token 324 to be lost or introduce multipletokens 324. Since there is no formal connection between the physicalsystem and complete topology maps (diagrams that show how each softwareelement is connected to others within the system), some softwareelements might erroneously be serviced more than once per cycle, whileothers are completely neglected. However, these bugs can be extremelydifficult to track after the system is completed. The protocol isentangled with the functionality of each software element, and it isdifficult to separate the two for debugging purposes. Furthermore, if afew of the software elements are located on the same machine,performance of the implementation can be poor. The entangling ofcomputation and coordination requires intrusive modification to optimizethe system.

[0162] 2. Coordination-Centric Design's Approach to Coordination

[0163] The coordination-centric design methodology provides anencapsulating formalism for coordination. Components such as component100 interact using coordination interfaces, such as first, second, andthird coordination interfaces 200, 202, and 204, respectively.Coordination interfaces preserve component modularity while exposing anyparts of a component that participate in coordination. This technique ofconnecting components provides polymorphism in a similar fashion tosubtyping in object-oriented languages.

[0164]FIG. 4A is a detailed view of a component 400 and a resourceaccess coordination interface 402 connected to component 400 for use ina round-robin coordination protocol in accordance with the presentinvention. With reference to FIG. 4A, resource access coordinationinterface 402 facilitates implementation of a round-robin protocol thatis similar to the token-passing round-robin protocol described above.Resource access coordination interface 402 has a single bit of controlstate, called access, which is shown as an arbitrated control port 404that indicates whether or not component 400 is holding a virtual token(not shown). Component 400 can only use a send message port 406 onaccess coordination interface 402 when arbitrated control port 404 istrue. Access coordination interface 402 further has a receive messageport 408.

[0165]FIG. 4B show a round-robin coordinator 410 in accordance with thepresent invention. With reference to FIG. 4B, round-robin coordinator410 has a set of coordinator coordination interfaces 412 for connectingto a set of components 400. Each component 400 includes a resourceaccess coordination interface 402. Each coordinator coordinationinterface 412 has a coordinator arbitrated control port 414, an incomingsend message port 416 and an outgoing receive message port 418.Coordinator coordination interface 412 in complimentary to resourceaccess coordination interface 402, and vice versa, because the ports onthe two interfaces are compatible and can function to transferinformation between the two interfaces.

[0166] The round-robin protocol requires round-robin coordinator 410 tomanage the coordination topology. Round-robin coordinator 410 is aninstance of more general abstractions called coordination classes, inwhich coordination classes define specific coordination protocols and acoordinator is a specific implementation of the coordination class.Round-robin coordinator 410 contains all information about howcomponents 400 are supposed to coordinate. Although round-robincoordinator 410 can have a distributed implementation, no component 400is required to keep references to any other component 400 (unlike thedistributed round-robin implementation shown in FIG. 3B). All requiredreferences are maintained by round-robin coordinator 410 itself, andcomponents 400 do not even need to know that they are coordinatingthrough round-robin. Resource access coordination interface 402 can beused with any coordinator that provides the appropriate complementaryinterface. A coordinator's design is independent of whether it isimplemented on a distributed platform or on a monolithic singleprocessor platform.

[0167] 3. Coordination Interfaces

[0168] Coordination interfaces are used to connect components tocoordinators. They are also the principle key to a variety of usefulruntime debugging techniques. Coordination interfaces support componentmodularity by exposing all parts of the component that participate inthe coordination protocol. Ports are elements of coordinationinterfaces, as are guarantees and requirements, each of which will bedescribed in turn.

[0169] A. Ports

[0170] A port is a primitive connection point for interconnectingcomponents. Each port is a five-tuple (T; A; Q; D; R) in which:

[0171] T represents the data type of the port. T can be one of int,boolean, char, byte, float, double, or cluster, in which clusterrepresents a cluster of data types (e.g., an int followed by a floatfollowed by two bytes).

[0172] A is a boolean value that is true if the port is arbitrated andfalse otherwise.

[0173] Q is an integer greater than zero that represents logical queuedepth for a port.

[0174] D is one of in, out, inout, or custom and represents thedirection data flows with respect to the port.

[0175] R is one of discard-on-read, discard-on-transfer, or hold andrepresents the policy for data removal on the port. Discard-on-readindicates that data is removed immediately after it is read (and anydata in the logical queue are shifted), discard-on-transfer indicatesthat data is removed from a port immediately after being transferred toanother port, and hold indicates that data should be held until it isoverwritten by another value.

[0176] Hold is subject to arbitration.

[0177] Custom directionality allows designers to specify ports thataccept or generate only certain specific values. For example, a designermay want a port that allows other components to activate, but notdeactivate, a mode. While many combinations of port attributes arepossible, we normally encounter only a few. The three most common aremessage ports (output or input), state ports (output, input, or both;sometimes arbitrated), and control ports (a type of state port). FIG. 5illustrates the visual syntax used for several common ports throughoutthis application. With reference to FIG. 5, this figure depicts anexported state port 502, an imported state port 504, an arbitrated stateport 506, an output data port 508, and an input data port 510.

[0178] 1. Message Ports

[0179] Message ports (output and input) data ports 508 and 510respectively) are either send (T; false; 1; out; discard-on-transfer) orreceive (T; false; Q; in; discard-on-read). Their function is totransfer data between components. Data passed to a send port istransferred immediately to the corresponding receive port, thus itcannot be retrieved from the send port later. Receive data ports canhave queues of various depths. Data arrivals on these ports arefrequently used to trigger and pass data parameters into actions. Valuesremain on receive ports until they are read.

[0180] 2. State Ports

[0181] State ports take one of three forms:

[0182] 1. (T; false; 1; out; hold)

[0183] 2. (T; false; 1; in; hold)

[0184] 3. (T; true; 1; inout; hold)

[0185] State ports, such as exported state port 502, imported state port504, and arbitrated state port 506, hold persistent values, and thevalue assigned to a state port may be arbitrated. This means that,unlike message ports, values remain on the state ports until changed.When multiple software elements simultaneously attempt to alter thevalue of arbitrated state port 506, the final value is determined basedon arbitration rules provided by the designer through an arbitrationcoordinator (not shown).

[0186] State ports transfer variable values between scopes, as explainedbelow. In coordination-centric design, all variables referenced by acomponent are local to that component, and these variables must beexplicitly declared in the component's scope. Variables can, however, bebound to state ports that are connected to other components. In this waya variable value can be transferred between components and the variablevalue achieves the system-level effect of a multivariable.

[0187] 3. Control Ports

[0188] Control ports are similar to state ports, but a control port islimited to having the boolean data type. Control ports are typicallybound to modes. Actions interact with a control port indirectly, bysetting and responding to the values of a mode that is bound to thecontrol port.

[0189] For example, arbitrated control port 404 shown in FIG. 4A is acontrol port that can be bound to a mode (not shown) containing allactions that send data on a shared channel. When arbitrated control port404 is false, the mode is inactive, disabling all actions that send dataon the channel.

[0190] B. Guarantees

[0191] Guarantees are formal declarations of invariant properties of acoordination interface. There can be several types of guarantees, suchas timing guarantees between events, guarantees between control state(e.g., state A and state B are guaranteed to be mutually exclusive),etc. Although a coordination interface's guarantees reflect propertiesof the component to which the coordination interface is connected, theguarantees are not physically bound to any internal portions of thecomponent. Guarantees can often be certified through static analysis ofthe software system. Guarantees are meant to cache various propertiesthat are inherent in a component or a coordinator in order to simplifystatic analysis of the software system.

[0192] A guarantee is a promise provided by a coordination interface.The guarantee takes the form of a predicate promised to be invariant. Inprinciple, guarantees can include any type of predicate (e.g., x>3, inwhich x is an integer valued state port, or t_(ea)−t_(eb)<2 ms).Throughout the remainder of this application, guarantees will be onlyevent-ordering guarantees (guarantees that specify acceptable orders ofevents) or control-relationship guarantees (guarantees pertaining toacceptable relative component behaviors).

[0193] C. Requirements

[0194] A requirement is a formal declaration of the properties necessaryfor correct software system functionality. An example of a requirementis a required response time for a coordination interface—the number ofmessages that must have arrived at the coordination interface before thecoordination interface can transmit, or fire, the messages. When twocoordination interfaces are bound together, the requirements of thefirst coordination interface must be conservatively matched by theguarantees of the second coordination interface (e.g., x<7 as aguarantee conservatively matches x<8 as a requirement). As withguarantees, requirements are not physically bound to anything within thecomponent itself. Guarantees can often be verified to be sufficient forthe correct operation of the software system in which the component isused. In sum, a requirement is a predicate on a first coordinationinterface that must be conservatively matched with a guarantee on acomplementary second coordination interface.

[0195] D. Conclusion Regarding Coordination Interfaces

[0196] A coordination interface is a four-tuple (P; G; R; I) in which:

[0197] P is a set of named ports.

[0198] G is a set of named guarantees provided by the interface.

[0199] R is a set of named requirements that must be matched byguarantees of connected interfaces.

[0200] I is a set of named coordination interfaces.

[0201] As this definition shows, coordination interfaces are recursive.Coordinator coordination interface 412, shown in FIG. 4B, used forround-robin coordination is called AccessInterface and is defined inTable 1. Constituent Value ports P = { access:StatePort,s:outMessagePort, r:inMessagePort } guarantees G = {

access

s.gen } requirements R = Ø interfaces I = Ø

[0202] Related to coordination interfaces is a recursive coordinationinterface descriptor, which is a five-tuple (P_(a); G_(a); R_(a); I_(d);N_(d)) in which:

[0203] P_(a) is a set of abstract ports, which are ports that may beincomplete in their attributes (i.e., they do not yet have a datatype).

[0204] G_(a) is a set of abstract guarantees, which are guaranteesbetween abstract ports.

[0205] R_(a) is a set of abstract requirements, which are requirementsbetween abstract ports.

[0206] I_(d) is a set of coordination interface descriptors.

[0207] N_(d) is an element of Q×Q, where Q={∞}∪Z+ and Z+ denotes the setof positive integers. N_(d) indicates the number or range of numbers ofpermissible interfaces.

[0208] Allowing coordination interfaces to contain other coordinationinterfaces is a powerful feature. It lets designers use commoncoordination interfaces as complex ports within other coordinationinterfaces. For example, the basic message ports described above arenonblocking, but we can build a blocking coordination interface (notshown) that serves as a blocking port by combining a wait state portwith a message port.

[0209] 4. Coordinators

[0210] A coordinator provides the concrete representations ofintercomponent aspects of a coordination protocol. Coordinators allow avariety of static analysis debugging methodologies for software systemscreated with the coordination-centric design methodology. A coordinatorcontains a set of coordination interfaces and defines the relationshipsthe coordination interfaces. The coordination interfaces complement thecomponent coordination interfaces provided by components operatingwithin the protocol. Through matched interface pairs, coordinatorseffectively describe connections between message ports, correlationsbetween control states, and transactions between components.

[0211] For example, round-robin coordinator 410, shown in FIG. 4B, mustensure that only one component 400 has its component control port 404'svalue, or its access bit, set to true. Round-robin coordinator 410 mustfurther ensure that the correct component 400 has its component controlport 404 set to true for the chosen sequence. This section presentsformal definitions of the parts that comprise coordinators: modes,actions, bindings, action triples, and constraints. These definitionsculminate in a formal definition of coordinators.

[0212] A. Modes

[0213] A mode is a boolean value that can be used as a guard on anaction. In a coordinator, the mode is most often bound to a control portin a coordination interface for the coordinator. For example, inround-robin coordinator 410, the modes of concern are bound to acoordinator control port 414 of each coordinator coordination interface412.

[0214] B. Actions

[0215] An action is a primitive behavioral element that can:

[0216] Respond to events.

[0217] Generate events.

[0218] Change modes.

[0219] Actions can range in complexity from simple operations up tocomplicated pieces of source code. An action in a coordinator is calleda transparent action because the effects of the action can beprecomputed and the internals of the action are completely exposed tothe coordination-centric design tools.

[0220] C. Bindings

[0221] Bindings connect input ports to output ports, control ports tomodes, state ports to variables, and message ports to events. Bindingsare transparent and passive. Bindings are simply conduits for eventnotification and data transfer. When used for event notification,bindings are called triggers.

[0222] D. Action Triples

[0223] To be executed, an action must be enabled by a mode and triggeredby an event. The combination of a mode, trigger, and action is referredto as an action triple, which is a triple (m; t; a) in which:

[0224] m is a mode.

[0225] t is a trigger.

[0226] a is an action.

[0227] The trigger is a reference to an event type, but it can be usedto pass data into the action. Action triples are written:mode:trigger:action

[0228] A coordinator's actions are usually either pure control, in whichboth the trigger and action performed affect only control state, or puredata, in which both the trigger and action performed occur in the datadomain. In the case of round-robin coordinator 410, the following set ofactions is responsible for maintaining the appropriate state:

access₁:−access_(i):+access_((i+1) mod n)

[0229] The symbol “+” signifies a mode's activation edge (i.e., theevent associated with the mode becoming true), and the symbol “−”signifies its deactivation edge. When any coordinator coordinationinterface 412 deactivates its arbitrated control port 404's, access bit,the access bit of the next coordinator coordination interface 412 isautomatically activated.

[0230] E. Constraints

[0231] In this dissertation, constraints are boolean relationshipsbetween control ports. They take the form:

Condition

Effect

[0232] This essentially means that the Condition (on the left side ofthe arrow) being true implies that Effect (on the right side of thearrow) is also true. In other words, if Condition is true, then Effectshould also be true.

[0233] A constraint differs from a guarantee in that the guarantee islimited to communicating in-variant relationships between componentswithout providing a way to enforce the in-variant relationship. Theconstraint, on the other hand, is a set of instructions to the runtimesystem dealing with how to enforce certain relationships betweencomponents. When a constraint is violated, two corrective actions areavailable to the system: (1) modify the values on the left-hand side tomake the left-hand expression evaluate as false (an effect sometimestermed backpressure) or (2) alter the right-hand side to make it true.We refer to these techniques as LHM (left-hand modify) and RHM(right-hand modify). For example, given the constraint x

y and the value x

y, with RHM semantics the runtime system must respond by disabling y orsetting y to false. Thus the value of

y is set to true.

[0234] The decision of whether to use LHM, to use RHM, or even tosuspend enforcement of a constraint in certain situations candramatically affect the efficiency and predictability of the softwaresystem. Coordination-centric design does not attempt to solvesimultaneous constraints at runtime. Rather, runtime algorithms uselocal ordered constraint solutions. This, however, can result in someconstraints being violated and is discussed further below.

[0235] Round-robin coordinator 410 has a set of safety constraints toensure that there is never more than one token in the system:

access₁

∀_(j≠i)

access_(j)

[0236] The above equation translates roughly as access₁ implies notaccess_(j) for the set of all access_(j) where j is not equal to i. Eventhis simple constraint system can cause problems with local resolutionsemantics (as are LHM and RHM). If the runtime system attempted to fixall constraints simultaneously, all access modes would be shut down. Ifthey were fixed one at a time, however, any duplicate tokens would beerased on the first pass, satisfying all other constraints and leaving asingle token in the system.

[0237] Since high-level protocols can be built from combinations oflower-level protocols, coordinators can be hierarchically composed. Acoordinator is a six-tuple (I; M; B; N; A; X) in which:

[0238] I is a set of coordination interfaces.

[0239] M is a set of modes.

[0240] B is a set of bindings between interface elements (e.g., controlports and message ports) and internal elements (e.g., modes andtriggers).

[0241] N is a set of constraints between interface elements.

[0242] A is a set of action triples for the coordinator.

[0243] X is a set of subcoordinators.

[0244]FIGS. 6A, 6B, 6C, and 6D show a few simple coordinatorshighlighting the bindings and constraints of the respectivecoordinators. With reference to FIG. 6A, a unidirectional data transfercoordinator 600 transfers data in one direction between two components(not shown) by connecting incoming receive message port 408 to outgoingreceive message port 418 with a binding 602. With reference to FIG. 6B,bidirectional data transfer coordinator 604 transfers data back andforth between two components (not shown) by connecting incoming receivemessage port 408 to outgoing receive message port 418 with binding 602and connecting send message port 406 to incoming send message port 416with a second binding 602. Unidirectional data transfer coordinator 600and bidirectional data transfer coordinator 604 simply move data fromone message port to another. Thus each coordinator consists of bindingsbetween corresponding ports on separate coordination interfaces.

[0245] With reference to FIG. 6C, state unification coordinator 606ensures that a state port a 608 and a state port b 610 are always set tothe same value. State unification coordinator 606 connects state port a608 to state port b 610 with binding 602. With reference to FIG. 6D,control state mutex coordinator 612 has a first constraint 618 and asecond constraint 620 as follows:

[0246] (1) c

d and

[0247] (2) d

c.

[0248] Constraints 618 and 620 can be restated as follows:

[0249] (1) A state port c 614 having a true value implies that a stateport d 616 has a false value, and

[0250] (2) State port d 616 having a true value implies that state portc 614 has a false value.

[0251] A coordinator has two types of coordination interfaces: upinterfaces that connect the coordinator to a second coordinator, whichis at a higher level of design hierarchy and down interfaces thatconnect the coordinator either to a component or to a third coordinator,which is at a lower level of design hierarchy. Down interfaces havenames preceded with “˜”. Round-robin coordinator 410 has six downcoordination interfaces (previously referred to as coordinatorcoordination interface 412), with constraints that make the turning offof any coordinator control port 414 (also referred to as access controlport) turn on the coordinator control port 414 of the next coordinatorcoordination interface 412 in line. Table 2 presents all constituents ofthe round-robin coordinator. Constituent Value coordination interfaces I= ^(˜)AccessInterface₁₋₆ modes M = access₁₋₆ bindings B =∀_(1≦i≦6)(˜AccessInterface_(i).access, access₁) ∪ constraints N =∀_(1≦i≦6)(∀_({circumflex over ( )}(1≦j≦6)  (i≠j)) (access_(i)

access_(j)) actions A = ∀_(1≦i≦6) access_(i) : −access_(i) :+access_((i+1)) mod 6 subcoordinators X = Ø

[0252] This tuple describes an implementation of a round-robincoordination protocol for a particular system with six components, asshown in round-robin coordinator 410. We use a coordination class todescribe a general coordination protocol that may not have a fixednumber of coordinator coordination interfaces. The coordination class isa six-tuple (Ic; Mc; Bc; Nc; Ac; Xc) in which:

[0253] Ic is a set of coordination interface descriptors in which eachdescriptor provides a type of coordination interface and specifies thenumber of such interfaces allowed within the coordination class.

[0254] Mc is a set of abstract modes that supplies appropriate modeswhen a coordination class is instantiated with a fixed number ofcoordinator coordination interfaces.

[0255] Bc is a set of abstract bindings that forms appropriate bindingsbetween elements when the coordination class is instantiated.

[0256] Nc is a set of abstract constraints that ensures appropriateconstraints between coordination interface elements are in place asspecified at instantiation.

[0257] Ac is a set of abstract action triples for the coordinator.

[0258] Xc is a set of coordination classes (hierarchy).

[0259] While a coordinator describes coordination protocol for aparticular application, it requires many aspects, such as the number ofcoordination interfaces and datatypes, to be fixed. Coordination classesdescribe protocols across many applications. The use of the coordinationinterface descriptors instead of coordination interfaces letscoordination classes keep the number of interfaces and datatypesundetermined until a particular coordinator is instantiated. Forexample, a round-robin coordinator contains a fixed number ofcoordinator coordination interfaces with specific bindings andconstraints between the message and state ports on the fixed number ofcoordinator coordination interfaces. A round-robin coordination classcontains descriptors for the coordinator coordination interface type,without stating how many coordinator coordination interfaces, andinstructions for building bindings and constraints between ports on thecoordinator coordination interfaces when a particular round-robincoordinator is created.

[0260] 5. Components

[0261] A component is a six-tuple (I; A; M; V; S; X) in which:

[0262] I is a set of coordination interfaces.

[0263] A is a set of action triples.

[0264] M is a set of modes.

[0265] V is a set of typed variables.

[0266] S is a set of subcomponents.

[0267] X is a set of coordinators used to connect the subcomponents toeach other and to the coordination interfaces.

[0268] Actions within a coordinator are fairly regular, and hence alarge number of actions can be described with a few simple expressions.However, actions within a component are frequently diverse and canrequire distinct definitions for each individual action. Typically acomponent's action triples are represented with a table that has threecolumns: one for the mode, one for the trigger, and one for the actioncode. Table 3 shows some example actions from a component that can useround-robin coordination. Mode Trigger Action access tickAccessInterface.s.send(“Test message”); −access;

access tick waitCount+ +;

[0269] A component resembles a coordinator in several ways (for example,the modes and coordination interfaces in each are virtually the same).Components can have internal coordinators, and because of the internalcoordinators, components do not always require either bindings orconstraints. In the following subsections, various aspects of componentsare described in greater detail. Theses aspects of components includevariable scope, action transparency, and execution semantics for systemsof actions.

[0270] A. Variable Scope

[0271] To enhance a component's modularity, all variables accessed by anaction within the component are either local to the action, local to theimmediate parent component of the action, or accessed by the immediateparent component of the action via state ports in one of the parentcomponent's coordination interfaces. For a component's variables to beavailable to a hierarchical child component, they must be exported bythe component and then imported by the child of the component.

[0272] B. Action Transparency

[0273] An action within a component can be either a transparent actionor an opaque action. Transparent and opaque actions each have differentinvocation semantics. The internal properties, i.e. control structures,variable, changes in state, operators, etc., of transparent actions arevisible to all coordination-centric design tools. The design tools canseparate, observe, and analyze all the internal properties of opaqueactions. Opaque actions are source code. Opaque actions must be executeddirectly, and looking at the internal properties of opaque actions canbe accomplished only through traditional, source-level debuggingtechniques. An opaque action must explicitly declare any mode changesand coordination interfaces that the opaque action may directly affect.

[0274] C. Action Execution

[0275] An action is triggered by an event, such as data arriving ordeparting a message port, or changes in value being applied to a stateport. An action can change the value of a state port, generate an event,and provide a way for the software system to interact with low-leveldevice drivers. Since actions typically produce events, a single triggercan be propagated through a sequence of actions.

[0276] 6. Protocols Implemented with Coordination Classes

[0277] In this section, we describe several coordinators thatindividually implement some common protocols: subsumption, barriersynchronization, rendezvous, and dedicated RPC.

[0278] A. Subsumption Protocol

[0279] A subsumption protocol is a priority-based, preemptive resourceallocation protocol commonly used in building small, autonomous robots,in which the shared resource is the robot itself.

[0280]FIG. 7 shows a set of coordination interfaces and a coordinatorfor implementing the subsumption protocol. With reference to FIG. 7, asubsumption coordinator 700 has a set of subsumption coordinatorcoordination interfaces 702, which have a subsume arbitrated coordinatorcontrol port 704 and an incoming subsume message port 706. Each subsumecomponent 708 has a subsume component coordination interface 710.Subsume component coordination interface 710 has a subsume arbitratedcomponent control port 712 and an outgoing subsume message port 714.Subsumption coordinator 700 and each subsume component 708 are connectedby their respective coordination interfaces, 702 and 710. Eachsubsumption coordinator coordination interface 702 in subsumptioncoordinator 700 is associated with a priority. Each subsume component708 has a behavior that can be applied to a robot (not shown). At anytime, any subsume component 708 can attempt to assert its behavior onthe robot. The asserted behavior coming from the subsume component 708connected to the subsumption coordinator coordination interface 702 withthe highest priority is the asserted behavior that will actually beperformed by the robot. Subsume components 708 need not know anythingabout other components in the system. In fact, each subsume component708 is designed to perform independently of whether their assertedbehavior is performed or ignored.

[0281] Subsumption coordinator 700 further has a slave coordinatorcoordination interface 716, which has an outgoing slave message port718. Outgoing slave message port 718 is connected to an incoming slavemessage port 720. Incoming slave message port 720 is part of a slavecoordination interface 722, which is connected to a slave 730. When asubsume component 708 asserts a behavior and that component has thehighest priority, subsumption coordinator 700 will control slave 730(which typically controls the robot) based on the asserted behavior.

[0282] The following constraint describes the basis of the subsumptioncoordinator 700's behavior:$\left. {subsume}_{p}\Rightarrow{\underset{i = 1}{\overset{p - 1}{}}{{subsume}_{i}}} \right.$

[0283] This means that if any subsume component 708 has a subsumearbitrated component control port 712 that has a value of true, then alllower-priority subsume arbitrated component control ports 712 are set tofalse. An important difference between round-robin and subsumption isthat in round-robin, the resource access right is transferred only whensurrendered. Therefore, round-robin coordination has cooperative releasesemantics. However, in subsumption coordination, a subsume component 708tries to obtain the resource whenever it needs to and succeeds only whenit has higher priority than any other subsume component 708 that needsthe resource at the same time. A lower-priority subsume component 708already using the resource must surrender the resource whenever ahigher-priority subsume component 708 tries to access the resource.Subsumption coordination uses preemptive release semantics, whereby eachsubsume component 708 must always be prepared to relinquish theresource.

[0284] Table 4 presents the complete tuple for the subsumptioncoordinator. Constituent Value coordination interfaces I =(Subsume_(1-n)) ∪ (Output) modes M = subsume_(1-n) bindings B =∀_(1≦i≦n) (Subsume_(i).subsume, subsume_(i)) ∪ constraints N =∀_(1≦i≦n (∀) _((1≦j≦i)) subsume_(i)

subsume_(j)) actions A = Ø subcoordinators X = Ø

[0285] B. Barrier Synchronization Protocol

[0286] Other simple types of coordination that components might engagein enforce synchronization of activities. An example is barriersynchronization, in which each component reaches a synchronization pointindependently and waits. FIG. 8 depicts a barrier synchronizationcoordinator 800. With reference to FIG. 8, barrier synchronizationcoordinator 800 has a set of barrier synchronization coordinationinterfaces 802, each of which has a coordinator arbitrated state port804, named wait. Coordinator arbitrated state port 804 is connected to acomponent arbitrated state port 806, which is part of a componentcoordination interface 808. Component coordination interface 808 isconnected to a component 810. When all components 810 reach theirrespective synchronization points, they are all released from waiting.The actions for a barrier synchronization coordinator with n interfacesare:$\underset{0 \leq i < n}{\Lambda}{{wait}_{i}{::}{\forall_{0 \leq j < n}{- {wait}_{j}}}}$

[0287] In other words, when all wait modes (not shown) become active,each one is released. The blank between the two colons indicates thatthe trigger event is the guard condition becoming true.

[0288] C. Rendezvous Protocol

[0289] A resource allocation protocol similar to barrier synchronizationis called rendezvous. FIG. 9 depicts a rendezvous coordinator 900 inaccordance with the present invention. With reference to FIG. 9,rendezvous coordinator 900 has a rendezvous coordination interface 902,which has a rendezvous arbitrated state port 904. A set of rendezvouscomponents 906, each of which may perform different functions or havevastly different actions and modes, has a rendezvous componentcoordination interface 908, which includes a component arbitrated stateport 910. Rendezvous components 906 connect to rendezvous coordinator900 through their respective coordination interfaces, 908 and 902.Rendezvous coordinator 900 further has a rendezvous resourcecoordination interface 912, which has a rendezvous resource arbitratedstate port 914, also called available. A resource 916 has a resourcecoordination interface 918, which has a resource arbitrated state port920. Resource 916 is connected to rendezvous coordinator 900 by theircomplementary coordination interfaces, 918 and 912 respectively.

[0290] With rendezvous-style coordination, there are two types ofparticipants: resource 916 and several resource users, here rendezvouscomponents 916. When resource 916 is available, it activates itsresource arbitrated state port 920, also referred to as its availablecontrol port. If there are any waiting rendezvous components 916, onewill be matched with the resource; both participants are then released.This differs from subsumption and round-robin in that resource 916 playsan active role in the protocol by activating its available control port920.

[0291] The actions for rendezvous coordinator 900 are:

available_(l)

wait_(j): :−available_(l), −wait_(j)

[0292] This could also be accompanied by other modes that indicate thestatus after the rendezvous. With rendezvous coordination, it isimportant that only one component at a time be released from wait mode.

[0293] D. Dedicated RPC Protocol

[0294] A coordination class that differs from those described above isdedicated RPC. FIG. 10 depicts a dedicated RPC system. With reference toFIG. 10, a dedicated RPC coordinator 1000 has an RPC server coordinationinterface 1002, which includes an RPC server imported state port 1004,an RPC server output message port 1006, and an RPC server input messageport 1008. Dedicated RPC coordinator 1000 is connected to a server 1010.Server 1010 has a server coordination interface 1012, which has a serverexported state port 1014, a server input data port 1016, and a serveroutput data port 1018. Dedicated RPC coordinator 1000 is connected toserver 1010 through their complementary coordination interfaces, 1002and 1012 respectively. Dedicated RPC coordinator 1000 further has an RPCclient coordination interface 1020, which includes an RPC clientimported state port 1022, an RPC client input message port 1024, and anRPC client output message port 1026. Dedicated RPC coordinator 1000 isconnected to a client 1028 by connecting RPC client coordinationinterface 1020 to a complementary client coordination interface 1030.Client coordination interface 1030 has a client exported state port1032, a client output message port 1034, and a client input message port1036.

[0295] The dedicated RPC protocol has a client/server protocol in whichserver 1010 is dedicated to a single client, in this case client 1028.Unlike the resource allocation protocol examples, the temporal behaviorof this protocol is the most important factor in defining it. Thefollowing transaction listing describes this temporal behavior:

[0296] Client 1028 enters blocked mode by changing the value stored atclient exported state port 1032 to true.

[0297] Client 1028 transmits an argument data message to server 1010 viaclient output message port 1034.

[0298] Server 1010 receives the argument (labeled “a”) data message viaserver input data port 1016 and enters serving mode by changing thevalue stored in server exported state port 1014 to true.

[0299] Server 1010 computes return value.

[0300] Server 1010 transmits a return (labeled “r”) message to client1020 via server output data port 1018 and exits serving mode by changingthe value stored in server exported state port 1014 to false.

[0301] Client 1028 receives the return data message via client inputmessage port 1036 and exits blocked mode by changing the value stored atclient exported state port 1032 to false.

[0302] This can be presented more concisely with an expressiondescribing causal relationships: $\begin{matrix}{T_{RPC} = \quad {{+ {{client}.{blocked}}}->{{{client}.{transmits}}->}}} \\{\quad {{+ {{server}.{serving}}}->{{{server}.{transmits}}->}}} \\{{\quad \left. {{\left( {- {{server}.{serving}}} \right.}{{client}.{receives}}} \right)}->{- {{client}.{blocked}}}}\end{matrix}$

[0303] The transactions above describe what is supposed to happen. Otherproperties of this protocol must be described with temporal logicpredicates.

server.serving

client.blocked

server.serving

F(server.r.output)

server.a.input

F(server.serving)

[0304] The r in server.r.output refers to the server output data port1018, also labeled as the r event port on the server, and the a inserving.a.input refers to server input data port 1016, also labeled asthe a port on the server (see FIG. 10).

[0305] Together, these predicates indicate that (1) it is an error forserver 1010 to be in serving mode if client 1028 is not blocked; (2)after server 1010 enters serving mode, a response message is sent orelse an error occurs; and (3) server 1010 receiving a message means thatserver 1010 must enter serving mode. Relationships between control stateand data paths must also be considered, such as:

(client.a

client.blocked)

[0306] In other words, client 1028 must be in blocked mode whenever itsends an argument message.

[0307] The first predicate takes the same form as a constraint; however,since dedicated RPC coordinator 1000 only imports the client:blocked andserver:serving modes (i.e., through RPC client imported state port 1022and RPC server imported state port 1004 respectively), dedicated RPCcoordinator 1000 is not allowed to alter these values to comply. Infact, none of these predicates is explicitly enforced by a runtimesystem. However, the last two can be used as requirements and guaranteesfor interface type-checking.

[0308] 7. System-Level Execution

[0309] Coordination-centric design methodology lets systemspecifications be executed directly, according to the semanticsdescribed above. When components and coordinators are composed intohigher-order structures, however, it becomes essential to considerhazards that can affect system behavior. Examples include conflictingconstraints, in which local resolution semantics may either leave thesystem in an inconsistent state or make it cycle forever, andconflicting actions that undo one another's behavior. In the remainderof this section, the effect of composition issues on system-levelexecutions is explained.

[0310] A. System Control Configurations

[0311] A configuration is the combined control state of asystem—basically, the set of active modes at a point in time. In otherwords, a configuration in coordination-centric design is a bit vectorcontaining one bit for each mode in the system. The bit representing acontrol state is true when the control state is active and false whenthe control state is inactive. Configurations representing the completesystem control state facilitate reasoning on system properties andenable several forms of static analysis of system behavior.

[0312] B. Action-Trigger Propagation

[0313] Triggers are formal parameters for events. As mentioned earlier,there are two types of triggers: (1) control triggers, invoked bycontrol events such as mode change requests, and (2) data flow triggers,invoked by data events such as message arrivals or departures.Components and coordinators can both request mode changes (on the modesvisible to them) and generate new messages (on the message ports visibleto them). Using actions, these events can be propagated through thecomponents and coordinators in the system, causing a cascade of datatransmissions and mode change requests, some of which can cancel otherrequests. When the requests, and secondary requests implied by them, areall propagated through the system, any requests that have not beencanceled are confirmed and made part of the system's new configuration.

[0314] Triggers can be immediately propagated through their respectiveactions or delayed by a scheduling step. Recall that component actionscan be either transparent or opaque. Transparent actions typicallypropagate their triggers immediately, although it is not absolutelynecessary that they do so. Opaque actions typically must always delaypropagation.

[0315] 1. Immediate Propagation

[0316] Some triggers must be immediately propagated through actions, butonly on certain types of transparent actions. Immediate propagation canoften involve static precomputation of the effect of changes, whichmeans that certain actions may never actually be performed. For example,consider a system with a coordinator that has an action that activatesmode A and a coordinator with an action that deactivates mode B wheneverA is activated. Static analysis can be used to determine in advance thatany event that activates A will also deactivate B; therefore, thiseffect can be executed immediately without actually propagating itthrough A.

[0317] 2. Delayed Propagation

[0318] Trigger propagation through opaque actions must typically bedelayed, since the system cannot look into opaque actions to precomputetheir results. Propagation may be delayed for other reasons, such assystem efficiency. For example, immediate propagation requires tightsynchronization among software components. If functionality is spreadamong a number of architectural components, immediate propagation isimpractical.

[0319] C. A Protocol Implemented with a Compound Coordinator

[0320] Multiple coordinators are typically needed in the design of asystem. The multiple coordinators can be used together for a single,unified behavior. Unfortunately, one coordinator may interfere withanother's behavior.

[0321]FIG. 11 shows a combined coordinator 1100 with both preemption andround-robin coordination for controlling access to a resource, asdiscussed above. With reference to FIG. 11, components 1102, 1104, 1106,1108, and 1110 primarily use round-robin coordination, and each includesa component coordination interface 1112, which has a componentarbitrated control port 1114 and a component output message port 1116.However, when a preemptor component 1120 needs the resource, preemptorcomponent 1120 is allowed to grab the resource immediately. Preemptorcomponent 1120 has a preemptor component coordination interface 1122.Preemptor component coordination interface 1122 has a preemptorarbitrated state port 1124, a preemptor output message port 1126, and apreemptor input message port 1128.

[0322] All component coordination interfaces 1112 and preemptorcomponent coordination interface 1122 are connected to a complementarycombined coordinator coordination interface 1130, which has acoordinator arbitrated state port 1132, a coordinator input message port1134, and a coordinator output message port 1136. Combined coordinator1100 is a hierarchical coordinator and internally has a round-robincoordinator (not shown) and a preemption coordinator (not shown).Combined coordinator coordination interface 1130 is connected to acoordination interface to round-robin 1138 and a coordination interfaceto preempt 1140. Coordinator arbitrated state port 1132 is bound to botha token arbitrated control port 1142, which is part of coordinationinterface to round-robin 1138, and to a preempt arbitrated control port1144, which is part of coordination interface to preempt 1140.Coordinator input message port 1134 is bound to an interface to around-robin output message port 1146, and coordinator output messageport 1136 is bound to an interface to round-robin input message port1148.

[0323] Thus preemption interferes with the normal round-robin orderingof access to the resource. After a preemption-based access, the resourcemoves to the component that in round-robin-ordered access would be thesuccessor to preemptor component 1120. If the resource is preempted toofrequently, some components may starve.

[0324] D. Mixing Control and Data in Coordinators

[0325] Since triggers can be control-based, data-based, or both, andactions can produce both control and data events, control and dataflowaspects of a system are coupled through actions. Through combinations ofactions, designers can effectively employ modal data flow, in whichrelative schedules are switched on and off based on the systemconfiguration.

[0326] Relative scheduling is a form of coordination. Recognizing thisand understanding how it affects a design can allow a powerful class ofoptimizations. Many data-centric systems (or subsystems) use conjunctivefiring, which means that a component buffers messages until a firingrule is matched. When matching occurs, the component fires, consumingthe messages in its buffer that caused it to fire and generating amessage or messages of its own. Synchronous data flow systems are thosein which all components have only firing rules with constant messageconsumption and generation.

[0327]FIG. 12A shows a system in which a component N1 1200 is connectedto a component N3 1202 by a data transfer coordinator 1204 and acomponent N2 1206 is connected to component N3 1202 by a second datatransfer coordinator 1208. Component N3 1202 fires when it accumulatesthree messages on a port c 1210 and two messages on a port d 1212. Onfiring, component N3 1202 produces two messages on a port o 1214.Coordination control state tracks the logical buffer depth for thesecomponents. This is shown with numbers representing the logical queuedepth of each port in FIG. 12.

[0328]FIG. 12B shows the system of FIG. 12A in which data transfercoordinator 1204 and second data transfer coordinator 1208 have beenmerged to form a merged data transfer coordinator 1216. Merging thecoordinators in this example provides an efficient static schedule forcomponent firing. Merged data transfer coordinator 1216 fires componentN1 1200 three times and component N2 1206 twice. Merged data transfercoordinator 1216 then fires component N3 1202 twice (to consume allmessages produced by component N1 1200 and component N2 1206).

[0329] Message rates can vary based on mode. For example, a componentmay consume two messages each time it fires in one mode and four eachtime it fires in a second mode. For a component like this, it is oftenpossible to merge schedules on a configuration basis, in which eachconfiguration has static consumption and production rates for allaffected components.

[0330] E. Coordination Transformations

[0331] In specifying complete systems, designers must often specify notonly the coordination between two objects, but also the intermediatemechanism they must use to implement this coordination. While thisintermediate mechanism can be as simple as shared memory, it can also beanother coordinator; hence coordination may be, and often is, layered.For example, RPC coordination often sits on top of a TCP/IP stack or onan IrDA stack, in which each layer coordinates with peer layers on otherprocessing elements using unique coordination protocols. Here, eachlayer provides certain capabilities to the layer directly above it, andthe upper layer must be implemented in terms of them.

[0332] In many cases, control and communication synthesis can beemployed to automatically transform user-specified coordination to aselected set of standard protocols. Designers may have to manuallyproduce transformations for nonstandard protocols.

[0333] F. Dynamic Behavior with Compound Coordinators

[0334] Even in statically bound systems, components may need to interactin a fashion that appears dynamic. For example, RPC-style coordinationoften has multiple clients for individual servers. Here, there is noapparent connection between client and server until one is forged for atransaction. After the connection is forged, however, the coordinationproceeds in the same fashion as dedicated RPC.

[0335] Our approach to this is to treat the RPC server as a sharedresource, requiring resource allocation protocols to control access.However, none of the resource allocation protocols described thus farwould work efficiently under these circumstances. In the followingsubsections, an appropriate protocol for treating the RPC as a sharedresource will be described and how that protocol should be used as partof a complete multiclient RPC coordination class—one that uses the sameRPC coordination interfaces described earlier—will be discussed.

[0336] 1. First Come/First Serve protocol (FCFS)

[0337]FIG. 13 illustrates a first come/first serve (FCFS) resourceallocation protocol, which is a protocol that allocates a sharedresource to the requester that has waited longest. With reference toFIG. 13, a FCFS component interface 1300 for this protocol has a requestcontrol port 1302, an access control port 1304 and a component outgoingmessage port 1306. A FCFS coordinator 1308 for this protocol has a setof FCFS interfaces 1310 that are complementary to FCFS componentinterfaces 1300, having a FCFS coordinator request control port 1312, aFCFS coordinator access port 1314, and a FCFS coordinator input messageport 1316. When a component 1318 needs to access a resource 1320, itasserts request control port 1302. When granted access, FCFS coordinator1308 asserts the appropriate FCFS coordinator access port 1314,releasing FCFS coordinator request control port 1312.

[0338] To do this, FCFS coordinator 1308 uses a rendezvous coordinatorand two round-robin coordinators. One round-robin coordinator maintainsa list of empty slots in which a component may be enqueued, and theother round-robin coordinator maintains a list showing the nextcomponent to be granted access. When an FCFS coordinator request controlport 1312 becomes active, FCFS coordinator 1308 begins a rendezvousaccess to a binder action. When activated, this action maps theappropriate component 1318 to a position in the round-robin queues. Aseparate action cycles through one of the queues and selects the nextcomponent to access the server. As much as possible, FCFS coordinator1308 attempts to grant access to resource 1320 to the earliest component1318 having requested resource 1320, with concurrent requests determinedbased on the order in the rendezvous coordinator of the respectivecomponents 1318.

[0339] 2. Multiclient RPC

[0340]FIG. 14 depicts a multiclient RPC coordinator 1400 formed bycombining FCFS coordinator 1308 with dedicated RPC coordinator 1000.With reference to FIG. 14, a set of clients 1402 have a set of clientcoordination interfaces 1030, as shown in FIG. 10. In addition,multiclient RPC coordinator 1400 has a set of RPC client coordinationinterfaces 1020, as shown in FIG. 10. For each RPC client coordinationinterface 1020, RPC client input message port 1024, of RPC clientcoordination interface 1020, is bound to the component outgoing messageport 1306 of FCFS coordinator 1308. Message transfer action 1403 servesto transfer messages between RPC client input message port 1024 andcomponent outgoing message port 1306. For coordinating the actions ofmultiple clients 1402, multiclient RPC coordinator 1400 must negotiateaccesses to a server 1404 and keep track of the values returned byserver 1404.

[0341] F. Monitor Modes and Continuations

[0342] Features such as blocking behavior and exceptions can beimplemented in the coordination-centric design methodology with the aidof monitor modes. Monitor modes are modes that exclude all but aselected set of actions called continuations, which are actions thatcontinue a behavior started by another action.

[0343] 1. Blocking Behavior

[0344] With blocking behavior, one action releases control whileentering a monitor mode, and a continuation resumes execution after theanticipated response event. Monitor mode entry must be immediate (atleast locally), so that no unexpected actions can execute before theyare blocked by such a mode.

[0345] Each monitor mode has a list of actions that cannot be executedwhen it is entered. The allowed (unlisted) actions are either irrelevantor are continuations of the action that caused entry into this mode.There are other conditions, as well. This mode requires an exceptionaction if forced to exit. However, this exception action is not executedif the monitor mode is turned off locally.

[0346] When components are distributed over a number of processingelements, it is not practical to assume complete synchronization of thecontrol state. In fact, there are a number of synchronization optionsavailable as detailed in Chou, P “Control Composition and Synthesis ofDistributed Real-Time Embedded Systems”, Ph.D. dissertation, Universityof Washington, 1998.

[0347] 2. Exception Handling

[0348] Exception actions are a type of continuation. When in a monitormode, exception actions respond to unexpected events or events thatsignal error conditions. For example, multiclient RPC coordinator 1400can bind −client.blocked to a monitor mode and set an exception actionon +server.serving. This will signal an error whenever the server beginsto work when the client is not blocked for a response.

[0349] 8. A Complete System Example

[0350]FIG. 15 depicts a large-scale example system under thecoordination-centric design methodology. With reference to FIG. 15, thelarge scale system is a bimodal digital cellular network 1500. Network1500 is for the most part a simplified version of a GSM (global systemfor mobile communications) cellular network. This example shows ingreater detail how the parts of coordination-centric design worktogether and demonstrates a practical application of the methodology.Network 1500 has two different types of cells, a surface cell 1502 (alsoreferred to as a base station 1502) and a satellite cell 1504. Thesecells are not only differentiated by physical position, but by thetechnologies they use to share network 1500. Satellite cells 1504 use acode division multiple access (CDMA) technology, and surface cells 1502use a time division multiple access (TDMA) technology. Typically, thereare seven frequency bands reserved for TDMA and one band reserved forCDMA. The goal is for as much communication as possible to be conductedthrough the smaller TDMA cells, here surface cells 1502, because powerrequirements for a CDMA cells, here satellite cell 1504, increase withthe number of users in the CDMA cell. Mobile units 1506, or wirelessdevices, can move between surface cells 1502, requiring horizontalhandoffs between surface cells 1502. Several surface cells 1502 aretypically connected to a switching center 1508. Switching center 1508 istypically connected to a telephone network or the Internet 1512. Inaddition to handoffs between surface cells 1502, the network must beable to hand off between switching centers 1508. When mobile units 1506leave the TDMA region, they remain covered by satellite cells 1504 viavertical handoffs between cells. Since vertical handoffs requirechanging protocols as well as changing base stations and switchingcenters, they can be complicated in terms of control.

[0351] Numerous embedded systems comprise the overall system. Forexample, switching center 1508 and base stations, surface cells 1502,are required as part of the network infrastructure, but cellular phones,handheld Web browsers, and other mobile units 1506 may be supported foraccess through network 1500. This section concentrates on the softwaresystems for two particular mobile units 1506: a simple digital cellularphone (shown in FIG. 16) and a handheld Web browser (shown in FIG. 24).These examples require a wide variety of coordinators and reusablecomponents. Layered coordination is a feature in each system, because afunction of many subsystems is to perform a layered protocol.Furthermore, this example displays how the hierarchically constructedcomponents can be applied in a realistic system to help manage thecomplexity of the overall design.

[0352] To begin this discussion, we describe the cellular phone indetail, focusing on its functional components and the formalization oftheir interaction protocols. We then discuss the handheld Web browser inless detail but highlight the main ways in which its functionality andcoordination differ from those of the cellular phone. In describing thecellular phone, we use a top-down approach to show how a coherent systemorganization is preserved, even at a high level. In describing thehandheld Web browser, we use a bottom-up approach to illustratecomponent reuse and bottom-up design.

[0353] A. Cellular Phone

[0354]FIG. 16 shows a top-level coordination diagram of the behavior ofa cell phone 1600. Rather than using a single coordinator thatintegrates the components under a single protocol, we use severalcoordinators in concert. Interactions between coordinators occur mainlywithin the components to which they connect.

[0355] With reference to FIG. 16, cell phone 1600 supports digitalencoding of voice streams. Before it can be used, it must beauthenticated with a home master switching center (not shown). Thisauthentication occurs through a registered master switch for each phoneand an authentication number from the phone itself. There are variousauthentication statuses, such as full access, grey-listed, orblacklisted. For cell phone 1600, real-time performance is moreimportant than reliability. A dropped packet is not retransmitted, and alate packet is dropped since its omission degrades the signal less thanits late incorporation.

[0356] Each component of cell phone 1600 is hierarchical. A GUI 1602lets users enter phone numbers while displaying them and query anaddress book 1604 and a logs component 1606. Address book 1604 is adatabase that can map names to phone numbers and vice versa. GUI 1602uses address book 1604 to help identify callers and to look up phonenumbers to be dialed. Logs 1606 track both incoming and outgoing callsas they are dialed. A voice component 1608 digitally encodes anddecodes, and compresses and decompresses, an audio signal. A connectioncomponent 1610 multiplexes, transmits, receives, and demultiplexes theradio signal and separates out the voice stream and calleridentification information.

[0357] Coordination among the above components makes use of several ofthe coordinators discussed above. Between connection component 1610 anda clock 1612, and between logs 1606 and connection component 1610, areunidirectional data transfer coordinators 600 as described withreference to FIG. 6A. Between voice component 1608 and connectioncomponent 1610, and between GUI 1602 and connection component 1610, arebidirectional data transfer coordinators 604, as described withreference to FIG. 6B. Between clock 1612 and GUI 1602 is a stateunification coordinator 606, as described with reference to FIG. 6C.Between GUI 1602 and address book 1604 is a dedicated RPC coordinator1000 as described with reference to FIG. 10, in which address book 1604has client 1028 and GUI 1602 has server 1010.

[0358] There is also a custom GUI/log coordinator 1614 between logs 1606and GUI 1602. GUI/log coordinator 1614 lets GUI 1602 transfer new loggedinformation through an r output message port 1616 on a GUI coordinationinterface 1618 to an r input message port 1620 on a log coordinationinterface 1622. GUI/log coordinator 1614 also lets GUI 1602 choosecurrent log entries through a pair of c output message ports 1624 on GUIcoordination interface 1618 and a pair of c input message ports 1626 onlog coordination interface 1622. Logs 1606 continuously display oneentry each for incoming and outgoing calls.

[0359] 1. GUI Component

[0360]FIG. 17A is a detailed view of GUI component 1602, of FIG. 16.With reference to FIG. 17A, GUI component 1602 has two inner components,a keypad 1700 and a text-based liquid crystal display 1702, as well asseveral functions of its own (not shown). Each time a key press occurs,it triggers an action that interprets the press, depending on the modeof the system. Numeric presses enter values into a shared dialingbuffer. When a complete number is entered, the contents of this bufferare used to establish a new connection through connection component1610. Table 5 shows the action triples for GUI 1602. Mode Trigger ActionIdle

numBuffer.append(keypress.val) Send radio.send(numBuffer.val) +outgoingCall Disconnect Nil Leftarrow AddressBook.forward() + lookupModeRightarrow log.lastcall() + outlog LookupMode LeftarrowAddressBook.forward() Rightarrow AddressBook.backward()

[0361] An “Addr Coord” coordinator 1704 includes an address book mode(not shown) in which arrow key presses are transformed into RPC calls.

[0362] 2. Logs Component

[0363]FIG. 17B is a detailed view of logs component 1606, which tracksall incoming and outgoing calls. With reference to FIG. 17B, both GUIcomponent 1602 and connection component 1610 must communicate with logscomponent 1606 through specific message ports. Those specific messageports include a transmitted number message port 1720, a received numbermessage port 1722, a change current received message port 1724, a changecurrent transmitted message port 1726, and two state ports 1728 and 1729for presenting the current received and current transmitted values,respectively.

[0364] Logs component 1606 contains two identical single-log components:a send log 1730 for outgoing calls and a receive log 1740 for incomingcalls. The interface of logs component 1606 is connected to theindividual log components by a pair of adapter coordinators, Adap1 1750and Adap2 1752. Adap1 1750 has an adapter receive interface 1754, whichhas a receive imported state port 1756 and a receive output message port1758. Adap1 1750 further has an adapter send interface 1760, which has asend imported state port 1762 and a send output message port 1764.Within Adap1, state port 1728 is bound to receive imported state port1756, change current received message port 1724 is bound to receiveoutput message port 1758, received number message port 1722 is bound toa received interface output message port 1766 on a received numbercoordination interface 1768, change current transmitted message port1726 is bound to send output message port 1764, and state port 1729 isbound to Up.rc is bound to send imported state port 1762.

[0365] 3. Voice Component

[0366]FIG. 18A is a detailed view of voice component 1608 of FIG. 16.Voice component 1608 has a compression component 1800 for compressingdigitized voice signals before transmission, a decompression component1802 for decompressing received digitized voice signals, and interfaces1804 and 1806 to analog transducers (not shown) for digitizing sound tobe transmitted and for converting received transmissions into sound.Voice component 1608 is a pure data flow component containing soundgenerator 1808 which functions as a white-noise generator, a ring tonegenerator, and which has a separate port for each on sound generatorinterface 1810, and voice compression functionality in the form ofcompression component 1800 and decompression component 1802.

[0367] 4. Connection Component

[0368]FIG. 18B is a detailed view of connection component 1610 of FIG.16. With reference to FIG. 18B, connection component 1610 coordinateswith voice component 1608, logs component 1606, clock 1612, and GUI1602. In addition, connection component 1610 is responsible forcoordinating the behavior of cell phone 1600 with a base station thatowns the surface cell 1502 (shown in FIG. 15), a switching center 1508(shown in FIG. 15), and all other phones (not shown) within surface cell1502. Connection component 1610 must authenticate users, establishconnections, and perform handoffs as needed—including appropriatechanges in any low-level protocols (such as a switch from TDMA to CDMA).

[0369]FIG. 19 depicts a set of communication layers between connectioncomponent 1610 of cell phone 1600 and base station 1502 or switchingcenter 1508. With reference to FIG. 19, has several subcomponents, orlower-level components, each of which coordinates with an equivalent, orpeer, layer on either base station 1502 or switching center 1508. Thesubcomponents of connection component 1610 include a cell phone callmanager 1900, a cell phone mobility manager 1902, a cell phone radioresource manager 1904, a cell phone link protocol manager 1906, and acell phone transport manager 1908 which is responsible for coordinatingaccess to and transferring data through the shared airwaves TDMA andCDMA coordination. Each subcomponent will be described in detailincluding how each fits into the complete system.

[0370] Base station 1502 has a call management coordinator 1910, amobility management coordinator 1912, a radio resource coordinator 1914(BSSMAP 1915), a link protocol coordinator 1916 (SCCO 1917), and atransport coordinator 1918 (MTP 1919). Switching center 1508 has aswitching center call manager 1920, a switching center mobility manager1922, (a BSSMAP 1924, a SCCP 1926, and an MTP 1928).

[0371] a. Call Management

[0372]FIG. 20 is a detailed view of a call management layer 2000consisting of cell phone call manager 1900, which is connected toswitching center call manager 1920 by call management coordinator 1910.With reference to FIG. 20, call management layer 2000 coordinates theconnection between cell phone 1600 and switching center 1508. Callmanagement layer 2000 is responsible for dialing, paging, and talking.Call management layer 2000 is always present in cell phone 1600, thoughnot necessarily in Internet appliances (discussed later). Cell phonecall manager 1900 includes a set of modes (not shown) for callmanagement coordination that consists of the following modes:

[0373] Standby

[0374] Dialing

[0375] RingingRemote

[0376] Ringing

[0377] CallInProgress

[0378] Cell phone call manager 1900 has a cell phone call managerinterface 2002. Cell phone call manager interface 2002 has a portcorresponding to each of the above modes. The standby mode is bound to astandby exported state port 2010. The dialing mode is bound to a dialingexported state port 2012. The RingingRemote mode is bound to aRingingRemote imported state port 2014. The Ringing mode is bound to aringing imported state port 2016. The CallInProgress mode is bound to aCallInProgress arbitrated state port 2018.

[0379] Switching center call manager 1920 includes the following modes(not shown) for call management coordination at the switching center:

[0380] Dialing

[0381] RingingRemote

[0382] Paging

[0383] CallInProgress

[0384] Switching center call manager 1920 has a switching center callmanager coordination interface 2040, which includes a port for each ofthe above modes within switching center call manager 1920.

[0385] When cell phone 1600 requests a connection, switching center 1508creates a new switching center call manager and establishes a callmanagement coordinator 1910 between cell phone 1600 and switching centercall manager 1920.

[0386] b. Mobility Management

[0387] A mobility management layer authenticates mobile unit 1506 orcell phone 1600. When there is a surface cell 1502 available, mobilitymanager 1902 contacts the switching center 1508 for surface cell 1502and transfers a mobile unit identifier (not shown) for mobile unit 1506to switching center 1508. Switching center 1508 then looks up a homemotor switching center for mobile unit 1506 and establishes a set ofpermissions assigned to mobile unit 1506. This layer also acts as aconduit for the call management layer. In addition, the mobilitymanagement layer performs handoffs between base stations 1502 andswitching centers 1508 based on information received from the radioresource layer.

[0388] c. Radio Resource

[0389] In the radio resource layer, radio resource manager 1904, choosesthe target base station 1502 and tracks changes in frequencies, timeslices, and CDMA codes. Cell phones may negotiate with up to 16 basestations simultaneously. This layer also identifies when handoffs arenecessary.

[0390] d. Link Protocol

[0391] The link layer manages a connection between cell phone 1600 andbase station 1502. In this layer, link protocol manager 1906 packagesdata for transfer to base station 1502 from cell phone 1600.

[0392] e. Transport

[0393]FIG. 21A is a detailed view of transport component 1908 ofconnection component 1610. Transport component 1908 has twosubcomponents, a receive component 2100 for receiving data and atransmit component 2102 for transmitting data. Each of thesesubcomponents has two parallel data paths a CDMA path 2104 and aTDMA/FDMA path 2106 for communicating in the respective networkprotocols.

[0394]FIG. 21B is a detailed view of a CDMA modulator 2150, whichimplements a synchronous data flow data path. CDMA modulator 2150 takesthe dot-product of an incoming data signal along path 2152 and a storedmodulation code for cell phone 1600 along path 2154, which is a sequenceof chips, which are measured time signals having a value of −1 or +1.

[0395] Transport component 1908 uses CDMA and TDMA technologies tocoordinate access to a resource shared among several cell phones 1600,i.e., the airwaves. Transport components 1908 supersede the FDMAtechnologies (e.g., AM and FM) used for analog cellular phones and forradio and television broadcasts. In FDMA, a signal is encoded fortransmission by modulating it with a carrier frequency. A signal isdecoded by demodulation after being passed through a band pass filter toremove other carrier frequencies. Each base station 1502 has a set offrequencies—chosen to minimize interference between adjacent cells. (Thearea covered by a cell may be much smaller than the net range of thetransmitters within it.)

[0396] TDMA, on the other hand, coordinates access to the airwavesthrough time slicing. Cell phone 1600 on the network is assigned a smalltime slice, during which it has exclusive access to the media. Outsideof the small time slice, cell phone 1600 must remain silent. Decoding isperformed by filtering out all signals outside of the small time slice.The control for this access must be distributed. As such, each componentinvolved must be synchronized to observe the start and end of the smalltime slice at the same instant.

[0397] Most TDMA systems also employ FDMA, so that instead of sharing asingle frequency channel, cell phones 1600 share several channels. Theband allocated to TDMA is broken into frequency channels, each with acarrier frequency and a reasonable separation between channels. Thususer channels for the most common implementations of TDMA can berepresented as a two-dimensional array, in which the rows representfrequency channels and the columns represent time slices.

[0398] CDMA is based on vector arithmetic. In a sense, CDMA performsinter-cell-phone coordination using data flow. Instead of breaking upthe band into frequency channels and time slicing these, CDMA regardsthe entire band as an n-dimensional vector space. Each channel is a codethat represents a basis vector in this space. Bits in the signal arerepresented as either 1 or −1, and the modulation is the inner productof this signal and a basis vector of mobile unit 1506 or cell phone1600. This process is called spreading, since it effectively takes anarrowband signal and converts it into a broadband signal.

[0399] Demultiplexing is simply a matter of taking the dot-product ofthe received signal with the appropriate basis vector, obtaining theoriginal 1 or −1. With fast computation and the appropriate codes orbasis vectors, the signal can be modulated without a carrier frequency.If this is not the case, a carrier and analog techniques can be used tofill in where computation fails. If a carrier is used, however, allunits use the same carrier in all cells.

[0400]FIG. 22 shows TDMA and CDMA signals for four cell phones 1600.With reference to FIG. 22, for TDMA, each cell phone 1600 is assigned atime slice during which it can transmit. Cell phone 1 is assigned timeslice t0, cell phone 2 is assigned time slice t1, cell phone 3 isassigned time slice t2, and cell phone 4 is assigned time slice t3. ForCDMA, each cell phone 1600 is assigned a basis vector that it multiplieswith its signal. Cell phone 1 is assigned the vector: $\begin{pmatrix}{- 1} \\1 \\{- 1} \\1\end{pmatrix}\quad$

[0401] Cell phone 2 is assigned the vector: $\begin{pmatrix}1 \\{- 1} \\1 \\{- 1}\end{pmatrix}\quad$

[0402] Cell phone 3 is assigned the vector: $\begin{pmatrix}1 \\1 \\{- 1} \\{- 1}\end{pmatrix}\quad$

[0403] Cell phone 4 is assigned the vector: $\begin{pmatrix}{- 1} \\{- 1} \\1 \\1\end{pmatrix}\quad$

[0404] Notice that these vectors form an orthogonal basis.

[0405] B. Handheld Web Browser

[0406] In the previous subsection, we demonstrated our methodology on acell phone with a top-down design approach. In this subsection, wedemonstrate our methodology with a bottom-up approach in building ahandheld Web browser.

[0407]FIG. 23A is a LCD touch screen component 2300 for a Web browserGUI (shown in FIG. 24A) for a wireless device 1506. With reference toFIG. 23A, a LCD touch screen component 2300, has an LCD screen 2302 anda touch pad 2304.

[0408]FIG. 23B is a Web page access component 2350 for fetching andformatting web pages. With reference to FIG. 23B, web access component2350 has a page fetch subcomponent 2352 and a page format subcomponent2354. Web access component 2350 reads hypertext markup language (HTML)from a connection interface 2356, sends word placement requests to adisplay interface 2358, and sends image requests to the connectioninterface 2356. Web access component 2350 also has a character inputinterface to allow users to enter page requests directly and to fill outforms on pages that have forms.

[0409]FIG. 24A shows a completed handheld Web browser GUI 2400. Withreference to FIG. 24A, handheld Web browser GUI 2400, has LCD touchscreen component 2300, web access component 2350, and a pen strokerecognition component 2402 that translates pen strokes entered on touchpad 2304 into characters.

[0410]FIG. 24B shows the complete component view of a handheld Webbrowser 2450. With reference to FIG. 24B, handheld Web browser 2450 isformed by connecting handheld Web browser GUI 2400 to connectioncomponent 1610 of cell phone 1600 (described with reference to FIG. 16)with bidirectional data transfer coordinator 604 (described withreference to FIG. 6B). Handheld Web browser 2450 is an example of mobileunit 1506, and connects to the Internet through the cellularinfrastructure described above. However, handheld Web browser 2450 hasdifferent access requirements than does cell phone 1600. For handheldWeb browser 2450, reliability is more important than real-time delivery.Dropped packets usually require retransmission, so it is better todeliver a packet late than to drop it. Real-time issues primarily affectdownload time and are therefore secondary. Despite this, handheld Webbrowser 2450 must coordinate media access with cell phones 1600, and soit must use the same protocol as cell phones 1600 to connect to thenetwork. For that reason, handheld Web browser 2450 can reuse connectioncomponent 1610 from cell phone 1600.

[0411] Debugging Techniques

[0412] In concept, debugging is a simple process. A designer locates thecause of undesired behavior in a system and fixes the cause. Inpractice, debugging—even of sequential software—remains difficult.Embedded systems are considerably more complicated to debug thansequential software, due to factors such as concurrence, distributedarchitectures, and real-time concerns. Issues taken for granted insequential software, like a schedule that determines the order of allevents (the program), are nonexistent in a typical distributed system.Locating and fixing bugs in these complex systems requires many factors,including an understanding of the thought processes underpinning thedesign.

[0413] Prior art research into debugging distributed systems is diverseand eclectic and lacks any standard notations. This application uses astandardized notation both to describe the prior art and the presentinvention. As a result of this standardized notation, the principles inthe prior art follow those published in the referenced works. However,the specific notation, theorems, etc., may differ.

[0414] The two general classes of debugging techniques are event-baseddebugging and state-based debugging. Most debugging techniques forgeneral-purpose distributed systems are event based. Event-baseddebugging techniques operate by collecting event traces from individualsystem components and causally relating those event traces. Thesetechniques require an ability to determine efficiently the causalordering among any given pair of events. Determining the causal ordercan be difficult and costly.

[0415] Events may be primitive, or they may be hierarchical clusters ofother events. Primitive events are abstractions of individual localoccurrences that might be important to a debugger. Examples of primitiveevents in sequential programs are variable assignments and subroutineentries or returns. Primitive events for distributed systems includemessage send and receive events.

[0416] State-based debugging techniques are less commonly used indebugging distributed systems. State-based debugging techniquestypically operate by presenting designers with views or snapshots of aprocess state. Distributed systems are not tightly synchronized, and sothese techniques traditionally involve only the state of individualprocesses. However, state-based debugging techniques can be applied moregenerally by relaxing the concept of an “instant in time” so that it canbe effectively applied to asynchronous processes.

[0417] 1. Event-Based Debugging

[0418] In this section, prior art systems for finding and trackingmeaningful event orderings, despite limits in observation, aredescribed. Typical ways in which event orderings are used invisualization tools through automated space/time diagrams are thendescribed.

[0419] A. Event Order Determination and Observation

[0420] The behavior of a software system is determined by the eventsthat occur and the order in which they occur. For sequential systems,this seems almost too trivial to mention; of course, a given set ofevents, such as

{x:=2, x:=x*2, x:=5, y:=x},

[0421] arranged in two different ways may describe two completelydifferent behaviors. However, since a sequential program is essentiallya complete schedule of events, ordering is explicit. Sequentialdebugging tools depend on the invariance of this event schedule to letprogrammers reproduce failures by simply using the same inputs. Indistributed systems, as in any concurrent system, it is neitherpractical nor efficient to completely schedule all events. Concurrentsystems typically must be designed with flexible event ordering.

[0422] Determining the order in which events occur in a distributedsystem is subject to the limits of observation. An observation is anevent record collected by an observer. An observer is an entity thatwatches the progress of an execution and records events but does notinterfere with the system. To determine the order in which two eventsoccur, an observer must measure them both against a common reference.

[0423]FIG. 25 shows a typical space/time diagram 2500, with spacerepresented on a vertical axis 2502 and time represented on a horizontalaxis 2504. With reference to FIG. 25, space/time diagram 2500 provides astarting point for discussing executions in distributed systems.Space/time diagram 2500 gives us a visual representation for discussingevent ordering and for comparing various styles of observation. A set ofhorizontal world lines 2506, 2508, and 2510 each represent an entitythat is stationary in space. The entities represented by horizontalworld lines 2506, 2508, and 2510 are called processes and typicallyrepresent software processes in the subject system. The entities canalso represent any entity that generates events in a sequential fashion.The spatial separation in the diagram, along vertical axis 2502,represents a virtual space, since several processes might execute on thesame physical hardware. A diagonal world line 2512 is called a messageand represents discrete communications that pass between two processes.A sphere 2514 represents an event. In subsequent figures vertical axis2502 and horizontal axis 2504 are omitted from any space/time diagrams,unless vertical axis 2502 and horizontal axis 2504 provide additionalclarity to a particular figure.

[0424]FIG. 26 shows a space/time diagram 2600 of two differentobservations of a single system execution, taken by a first observer2602 and a second observer 2604. With reference to FIG. 26, firstobserver 2602 and second observer 2604 are entities that record eventoccurrence. First observer 2602 and second observer 2604 must eachreceive distinct notifications of each event that occurs and each mustrecord the events in some total order. First observer 2602 and secondobserver 2604 are represented in space/time diagram 2600 as additionalprocesses, or horizontal world lines. Each event recorded requires asignal from its respective process to both first observer 2602 andsecond observer 2604. The signals from an event x 2606 on a process 2608to both first observer 2602 and second observer 2604 are embodied inmessages 2610 and 2612, respectively. First observer 2602 records eventx 2606 as preceding an event y 2614. However, second observer 2604records event y 2614 as preceding event x 2606. Such effects may becaused by nonuniform latencies within the system.

[0425] However, the observations of first observer 2602 and secondobserver 2604 are not equally valid. A valid observation is typically anobservation that preserves the order of events that depend on eachother. Second observer 2604 records the receipt of a message 2616 beforethat message is transmitted. Thus the observation from second observer2604 is not valid.

[0426]FIG. 27 shows a space/time diagram 2700 for a special, idealobserver, called the real-time observer (RTO) 2702. With reference toFIG. 27, RTO 2702 can view each event immediately as it occurs. Due tothe limitations of physical clocks, and efficiency issues in employingthem, it is usually not practical to implement RTO 2702. However, RTO2702 represents an upper bound on precision in event-orderdetermination.

[0427]FIG. 28 shows a space/time graph 2800 showing two validobservations of a system taken by two separate observers: RTO 2702 and athird observer 2802. With reference to FIG. 28, there is nothing specialabout the ordering of the observation taken by RTO 2702. Events d 2804,e 2806, and f 2808 are all independent events in this execution.Therefore, the observation produced by RTO 2702 and the observationproduced by third observer 2802 can each be used to reproduce equivalentexecutions of the system. Any observation in which event dependenciesare preserved is typically equal in value to an observation by RTO 2702.However, real-time distributed systems may need additional processes toemulate timing constraints.

[0428]FIG. 29 is a space/time diagram 2900 of a methodological observer,called the discrete Lamport Observer (DLO) 2902, that records each eventin a set of ordered bins. With reference to FIG. 29, DLO 2902 records anevent 2904 in an ordered bin 2906 based on the following rule: eachevent is recorded in the leftmost bin that follows all events on whichit depends. DLO 2902 views events discretely and does not need a clock.DLO 2902 does, however, require explicit knowledge of event dependency.To determine the bin in which each event must be placed, DLO 2902 needsto know the bins of the immediately preceding events. The observationproduced by DLO 2902 is also referred to as a topological sort of thesystem execution's event graph.

[0429] In the following, E is the set of all events in an execution. Theimmediate predecessor relation,

E×E, includes all pairs (e_(a), e_(b)) such that:

[0430] a) If e_(a) and e_(b) are on the same process, e_(a) precedese_(b) with no intermediate events.

[0431] b) If e_(b) is a receive event, e_(a) is the send event thatgenerated the message. Given these conditions, e_(a) is called theimmediate predecessor of e_(b).

[0432] Each event has at most two immediate predecessors. Therefore, DLO2902 need only find the bins of at most two records before eachplacement. The transitive closure of the immediate predecessor relationforms a causal relation. The causal relation,

E×E, is the smallest transitive relation such that e_(l)→e_(j)

e_(j).

[0433] This relation defines a partial order of events and furtherlimits the definition of a valid observation. A valid observation is anordered record of events from a given execution, i.e., (R, <), where e∈E

(record(e)) ∈R and < is an ordering operator. A valid observation has:

e _(i) ; e _(j) ∈E, e _(i)

e _(j)

record(e _(i))<record(e _(j))

[0434] The dual of the causal relation is a concurrence relation. Theconcurrence relation, E×E, includes all pairs (e_(a), e_(b)) such thatneither e_(a)

e_(b) nor e_(b)

e_(a). While the causal relation is transitive, the concurrence relationis not. The concurrence relation is symmetric, while the causal relationis not.

[0435] B. Event-Order Tracking

[0436] Debugging typically requires an understanding of the order inwhich events occur. Above, observers were presented as separateprocesses. While that treatment simplified the discussion of observersit is typically not a practical implementation of an observer. When theobserver is implemented as a physical process, the signals to indicateevents would have to be transformed into physical messages and thesystem would have to be synchronized to enable all messages to arrive ina valid order.

[0437]FIG. 30 depicts a space/time graph 3000 with each event having alabel 3002. With reference to FIG. 30, DLO 2902 can accurately placeevent records in their proper bins—even if received out of order—as longas it knows the bins of the immediate predecessors. If we know the binsin which events are recorded, we can determine something about theircausality. Fortunately, it is easy to label each event with the numberof its intended bin. Labels 3002 are analogous to time and are typicallycalled Lamport timestamps.

[0438] A Lamport timestamp is an integer t associated with an evente_(i) such that

e _(i)

e _(j)

t(e _(l))>t(e _(j))

[0439] Lamport timestamps can be assigned as needed, provided the labelsof an event's immediate predecessors are known. This information can bemaintained with a local counter, called a Lamport clock (not shown),t_(pi), on each process, P_(i). The clock's value is transmitted witheach message M_(j) as t_(Mj). Clock value t_(Pi) is updated with eachevent, as follows: ${tpi} = \begin{Bmatrix}{{{\max \left( {{tMj},{tpi}} \right)} + 1};} & {{if}\quad e\quad {is}\quad a\quad {receive}\quad {event}} \\{{{tpi} + 1};} & {otherwise}\end{Bmatrix}$

[0440] A labeling mechanism is said to characterize the causal relationif, based on their labels alone, it can be determined whether two eventsare causal or concurrent. Although Lamport timestamps are consistentwith causality (if t(e_(l))≧t(e_(j)), then e_(l)=e_(j)), they do notcharacterize the causal relation.

[0441]FIG. 31 is a space/time graph 3100 that demonstrates the inabilityof scalar timestamps to characterize causality between events. Withreference to FIG. 31, space/time graph 3100 shows event e₁ 3102, evente₂ 3104, and event e₃ 3106. e₁ 3102 causes e₂ 3104, and also e₁ 3102 isconcurrent with e₃ 3106 e₂ 3104 is concurrent with e₃ 3106 and it can beshown that e₃ 3106 appears, when scalar timestamps are used, concurrentwith both e₁ 3102 and e₂ 3104. However, since e₁ 3102

e₂ 23104 it is not possible for e₃ 3106 to be concurrent with both.

[0442] Event causality can be tracked completely using explicit eventdependence graphs, with directed edges from each event to its immediatepredecessors. Unfortunately, this method cannot store enough informationwith each record to determine whether two arbitrarily chosen events arecausally related without traversing the dependence graph.

[0443] Other labeling techniques, such as vector timestamps, cancharacterize causality. The typical formulation of vector timestamps isbased on the cardinality of event histories. A basis for vectortimestamp is established based on the following definitions andtheorems. An event history, H(e_(j)), of an event e_(j) is the set ofall events, e_(i), such that either since e_(l)

e_(j) or e₁

e_(l)=e_(j). The event history can be projected against specificprocesses. For a process P_(i): the P_(j) history projection ofH(e_(l)), H_(Pj) (e_(l)), is the intersection of H(e_(l)) and the set ofevents local to P_(j). The event graph represented by a space/timediagram can be partitioned into equivalence classes, with one class foreach process. The set of events local to P_(j) is just the P_(j)equivalence class.

[0444] The intersection of any two projections from the same process isidentical to at least one of the two projections. Two historyprojections from a single process, Hp(a) and Hp(b), must satisfy one ofthe following:

[0445] a) Hp(a)⊂Hp(b)

[0446] b) Hp(a)=Hp(b)

[0447] c) Hp(a)⊃Hp(b)

[0448] The cardinality of H_(Pj) (e_(l)) is thus the number of eventslocal to P_(j) that causally precede e_(i) and e_(l) itself. Since localevents always occur in sequence, we can uniquely identify an event byits process and the cardinality of its local history.

[0449] For events e_(a);

e _(b) with e _(a) ≠e _(b) , H _(Pea)(e _(a))

H _(Pea)(e _(b))

e _(a)

e _(b)

[0450]FIG. 32 shows a space/time diagram 3200 with vector timestampedevents. A vector timestamp 3202 is a vector label, t_(e), assigned toeach event, e∈E, such that the i^(th) element represents [H_(Pl)(e)].Given two events, e₁ and e₂, we can determine their causal ordering: ifvector t_(el) has a smaller value for its own process's entry than theother, t_(ej), has at that same position, then ei

ej. If both vectors have larger values for their own process entries,then e_(i)∥e_(j). It is not possible for both events to have smallervalues for their own entries because for events e_(a) and e_(b), e_(a)

e_(b) implies H_(Pea) (e_(a))

H_(Pea) (e_(b)). It is not necessary to know the local processes ofevents to determine their causal order using vector timestamps.

[0451] The causal order of two vector timestamped events, e_(a) ande_(b), from unknown processes can be determined with anelement-by-element comparison of their vector timestamps:$\left. {{\bigwedge\limits_{i = 1}^{n}{t_{ea}\lbrack i\rbrack}} \leq {t_{ea}\lbrack i\rbrack}}\Rightarrow\left. e_{a}\rightarrow e_{a} \right. \right.$$\left. {{{\bigwedge\limits_{i = 1}^{n}{t_{ea}\lbrack i\rbrack}} \leq {{t_{eb}\lbrack i\rbrack}\bigwedge {{{\bigwedge\limits_{i = 1}^{n}{t_{eb}\lbrack i\rbrack}} \leq {t_{ea}\lbrack i\rbrack}}}}}}\Rightarrow{e_{a}{}e_{b}} \right.$

[0452] Thus vector timestamps both fully characterize causality anduniquely identify each event in an execution.

[0453] Computing vector timestamps at runtime is similar to Lamporttimestamp computation. Each process (P_(s)) contains a vector clock({circumflex over (t)}_(Ps)) with elements for every process in thesystem, where {circumflex over (t)}_(Ps)[s] always equals the number ofevents local to P_(s). Snapshots of this vector counter are used tolabel each event, and snapshots are transmitted with each message. Therecipient of a message with a vector snapshot can update its own vectorcounter ({circumflex over (t)}_(Pr)) by replacing it withsup({circumflex over (t)}_(Ps), {circumflex over (t)}_(Pr)), theelement-wise maximum of {circumflex over (t)}_(Ps) and {circumflex over(t)}_(Pr).

[0454] This technique places enough information with each message todetermine message ordering. It is performed by comparing snapshotsattached to each message. However, transmission of entire snapshots isusually not practical, especially if the system contains a large numberof processes.

[0455] Vector clocks can however be maintained without transmittingcomplete snapshots. A transmitting process, P_(s), can send a list thatincludes only those vector clock values that have changed since its lastmessage. A recipient, P_(r), then compares the change list to itscurrent elements and updates those that are smaller. This requires eachprocess to maintain several vectors: one for itself and one for eachprocess to which it has sent messages. However, change lists do notcontain enough information to independently track message order.

[0456] The expense of maintaining vector clocks can be a strongdeterrent to employing them. Unfortunately, no technique with smallerlabels can characterize causality. It has been shown that the dimensionof the causal relation for an N-process distributed execution is N, andhence N-element vectors are the smallest labels characterizingcausality.

[0457] The problem results from concurrence, without which Lamport timewould be sufficient. Concurrence can be tracked with concurrency maps,where each event keeps track of all events with which it is concurrent.Since the maps characterize concurrency, adding Lamport time lets themalso characterize causality (the concurrency information disambiguatesthe scalar time). Unfortunately, concurrency maps can only beconstructed after-the-fact, since doing so requires an examination ofevents from all processes.

[0458] In some situations, distinguishing between concurrency andcausality is not a necessity, but merely a convenience. There arecompact labeling techniques that allow better concurrence detection thanLamport time. One such technique uses interval clocks, in which eachevent record is labeled with its own Lamport time and the Lamport timeof its earliest successor. This label then represents a Lamport timeinterval, during which the corresponding event was the latest known bythe process. This gives each event a wider region with which to detectconcurrence (indicated by overlapping intervals).

[0459] In cases in which there is little or no cross-process causality(few messages), interval timestamps are not much better than Lamporttimestamps. In cases with large numbers of messages, however, intervaltimestamps can yield better results.

[0460] C. Space/Time Displays in Debugging Tools

[0461] Space/time diagrams have typically proven useful in discussingevent causality and concurrence. Space/time diagrams are also oftenemployed as the user display in concurrent program debugging tools.

[0462] The Los Alamos parallel debugging system uses a text basedtime-process display, and Idd uses a graphic display. Both of these,however, rely on an accurate global real-time clock (impractical in mostsystems).

[0463]FIG. 33 shows a Partial Order Event Tracer (POET) display 3300.The POET system supports several different languages and run-timeenvironments, including Hermes, a high-level interpreted language fordistributed systems, and Java. With reference to FIG. 33, POET display3300 distinguishes among several types of events by shapes, shading, andalignment of corresponding message lines.

[0464] A Distributed Program Debugger (DPD) is based on a RemoteExecution Manager (REM) framework. The REM framework is a set of serverson interconnected Unix machines in which each server is a Unixuser-level process. Processes executing in this framework can create andcommunicate with processes elsewhere in the network as if they were allon the same machine. DPD uses space/time displays for debuggingcommunication only, and it relies on separate source-level debuggers forindividual processes.

[0465] 2. Abstraction in Event-Based Debugging

[0466] Simple space/time displays can be used to present programmerswith a wealth of information about distributed executions. Typically,however, space/time diagrams are too abstract to be an ultimatedebugging solution. Space/time diagrams show high-level events andmessage traffic, but they do not support designer interaction with thesource code. On the other hand, simple space/time diagrams may sometimeshave too much detail. Space/time diagrams display each distinctlow-level message that contributes to a high-level transaction withoutsupport for abstracting the transaction.

[0467]FIG. 34 is a space/time diagram 3400 having a first compound event3402 and a second compound event 3404. With reference to FIG. 34, eventhough a pair of primitive events are either causally related orconcurrent, first and second compound events 3402 and 3404, or any otherpair of compound events, might be neither causally related norconcurrent. Abstraction is typically applied across twodimensions—events and processes—to aid in the task of debuggingdistributed software. Event abstraction represents sequences of eventsas single entities. A group of events may occasionally have a specificsemantic meaning that is difficult to recognize, much as streams ofcharacters can have a meaning that is difficult to interpret withoutproper spacing and punctuation. Event abstraction can in somecircumstances complicate the relationships between events.

[0468] Event abstraction can be applied in one of three ways: filtering,clustering, and interpretation. With event filtering, a programmerdescribes event types that the debugger should ignore, which are thenhidden from view. With clustering, the debugger collects a number ofevents and presents the group as a single event. With interpretation,the debugger parses the event stream for event sequences with specificsemantic meaning and presents them to a programmer.

[0469] Process abstraction is usually applied only as hierarchicalclustering. The remainder of this section discusses these specific eventand process abstraction approaches.

[0470] A. Event Filtering and Clustering

[0471] Event filtering and clustering are techniques used to hide eventsfrom a designer and thereby reduce clutter. Event filters excludeselected events from being tracked in event-based debugging techniques.In most cases, this filtering is implicit and cannot be modified withoutchanging the source code because the source code being debugged isdesigned to report only certain events to the debugger. When deployed,the code will report all such events to the tool. This approach isemployed in both DPD and POET, although some events may be filtered fromthe display at a later time.

[0472] An event cluster is a group of events represented as a singleevent. The placement of an event in a cluster is based on simpleparameters, such as virtual time bounds and process groups. Eventclusters can have causal ambiguities. For example, one cluster maycontain events that causally precede events in a second cluster, whileother events causally follow certain events in the second cluster.

[0473]FIG. 35 shows a POET display 3500 involving a first convex eventcluster 3502 and a second convex event cluster 3504. POET uses avirtual-time-based clustering technique that represents convex eventclusters as single abstract events. A convex event cluster is a set ofevent instances, C, such that for events

a, b, c∈E with

a, c∈C, a

b

b

c

b∈C

[0474] Convex event clusters, unlike generic event clusters, cannotoverlap.

[0475] B. Event Interpretation (Specific Background for BehavioralAbstraction)

[0476] The third technique for applying event abstraction isinterpretation, also referred to as behavioral abstraction. Both termsdescribe techniques that use debugging tools to interpret the behaviorrepresented by sequences of events and present results to a designer.Most approaches to behavioral abstraction let a designer describesequences of events using expressions, and the tools recognize thesequence of events through a combination of customized finite automatafollowed by explicit checks. Typically, matched expressions generate newevents.

[0477] 1. Event Description Language (EDL)

[0478] One of the earliest behavioral abstraction technique was eventdescription language (EDL), in which event streams are pattern-matchedusing shuffle automata. A match produces a new event that can, in turn,be part of another pattern. Essentially, abstract events arehierarchical and are built from the bottom up.

[0479] This approach can recognize event patterns that containconcurrent events. There are, however, several weaknesses in thisapproach. First, shuffle automata match events from a linear stream,which is subject to a strong observational bias. In addition, even ifthe stream constitutes a valid observation, interleaving may cause falseintermediates between an event and its immediate successor. Finally,concurrent events appear to occur in some specific order.

[0480] Bates partially compensates for these problems in three ways.First, all intermediates between two recognized events areignored—hence, false intermediates are skipped. Unfortunately, trueintermediates are also skipped, making error detection difficult.Second, the shuffle operator, Δ, is used to identify matches withconcurrent events. Unfortunately, shuffle recognizes events that occurin any order, regardless of whether they are truly ordered in thecorresponding execution. For example, e₁Δe₂ can match with either e₁

e₂ or e₂

e₁ in the event stream, but this means the actual matches could be: e₁

e₂, e₂

e₁, in addition to the e₁∥e₂ that the programmer intended to match.Third, the programmer can prescribe explicit checks to be performed oneach match before asserting the results. However, the checks allowed donot include causality or concurrence checks.

[0481] 2. Chain Expressions

[0482] Chain expressions, used in the Ariadne parallel debugger are analternate way to describe distributed behavior patterns that have bothcausality and concurrence. These behavioral descriptions are based onchains of events (abstract sequences not bound to processes), p-chains(chains bound to processes), and pt-chains (composed p-chains). Thesyntax for describing chain expressions is fairly simple, with <a b>representing two causally related events and |[a b]| representing twoconcurrent events.

[0483] The recognition algorithm has two functions. First, the algorithmrecognizes the appropriate event sequence from a linear stream, using anondeterminate finite automaton (NFA). Second, the algorithm checks therelationships between specific events.

[0484] For example, when looking for sequences that match the expression<|[a b]|e> (viz., a and b are concurrent, and both causally precede c),Ariadne will find the sequence a b c and then verify the relationshipsamong them. Unfortunately, the fact that sequences are picked in orderfrom a linear stream before relationships are checked can cause certainmatches to be missed. For example, |[a b]| and |[b a]| should have thesame meaning, but they do not cause identical matches. This is becauseAriadne uses NFAs as the first stage in event abstraction. In thetotally ordered stream to which an NFA responds, either a will precedeb, preventing the NFA for the second expression from recognizing thestring, or b will precede a, preventing the NFA for the first expressionfrom recognizing the string.

[0485] 3. Distributed Abstraction

[0486] The behavioral abstraction techniques described so far rely oncentralized abstraction facilities. These facilities can be distributed,as well. The BEE (Basis for distributed Event Environments) project is adistributed, hierarchical, event-collection system, with debuggingclients located with each process.

[0487]FIG. 36 show a Basis for distributed Event Environments (BEE)abstraction facility 3600 for a single client. With reference to FIG.36, event interpretation is performed at several levels. The first is anevent sensor 3602, inserted into the source of the program under testand invoked whenever a primitive event occurs during execution. The nextlevel is an event generator 3604, where information—including timestampsand process identifiers—is attached to each event. Event generator 3604uses an event table 3606 to determine whether events should be passed toan event handler 3608 or simply dropped. Event handler 3608 managesevent table 3606 within event generator 3604. Event handler 3608 filtersand collects events and routes them to appropriate event interpreters(not shown). Event interpreters (not shown) gather events from a numberof clients (not shown) and aggregate them for presentation to aprogrammer. Clients and their related event interpreters are placedtogether in groups managed by an event manager (not shown). A weaknessof this technique is that it does not specifically track causality.Instead, this technique relies on the real-timestamps attached tospecific primitive or abstract events. However, as discussed above thesetimestamps are not able to characterize causality.

[0488] C. Process Clustering

[0489] Most distributed computing environments feature flat processstructures, with few formally stated relationships among processes.Automatic process clustering tools can partially reverse-engineer ahierarchical structure to help remove spurious information from adebugger's view. Intuitively, a good cluster hierarchy should reveal, atthe top level, high-level system behavior, and the resolution shouldimprove proportionally with the number of processes exposed. A poorcluster hierarchy would show very little at the top level and wouldrequire a programmer to descend several hierarchical levels beforegetting even a rough idea about system behavior. Process clusteringtools attempt to identify common interaction patterns—such asclient-server, master-slave, complex server, layered system, and soforth. When these patterns are identified, the participants areclustered together. Clusters can then serve as participants ininteraction patterns to be further clustered. Typically, these clusterhierarchies are strictly trees, as shown in FIG. 37, which depicts ahierarchical construction of process clusters 3700. With reference toFIG. 37, a square node 3702 represents a process (not shown) and a roundnode 3704 represents a process cluster (not shown).

[0490] Programmers can choose a debugging focus, in which they specifythe aspects and detail levels they want to use to observe an execution.With reference to FIG. 37, a representative debugging focus thatincludes nodes I, J, E, F, G, and H is shown. One drawback of thisapproach is that when a parent cluster is in focus, none of its childrencan be. For example, if we wanted to look at process K in detail, wewould also need to expose at least as much detail for processes E and Land process cluster D.

[0491] Each process usually participates in many types of interactionswith other processes. Therefore, the abstraction tools mustheuristically decide between several options. These decisions have asubstantial impact on the quality of a cluster hierarchy. Prior artsystems have evaluated the quality of a clustering tool by measuring thecohesion, which though expressed quantitatively is actually aqualitative measurement (the higher the better) within a cluster and thecoupling, a qualitative measure of the information clusters must knowabout each other (the higher the worse), between clusters. For a clusterP of m processes, cohesion is quantified by:${{Cohesion}(P)} = \frac{\sum\limits_{i < J}{{Sim}_{f}\left( {p_{i},p_{J}} \right)}}{{m\left( {m - 1} \right)}/2}$

[0492] where Sim_(f) (P₁, P₂) is a similarity metric that equals:${Sim}_{f} = \frac{A{\langle\left. {\hat{C}}_{P_{1}} \middle| {\hat{C}}_{P_{2}} \right.\rangle}}{{{\hat{C}}_{P_{1}}} \cdot {{\hat{C}}_{P_{2}}}}$

[0493] Here, <â|{circumflex over (b)}> denotes the scaler product ofvectors â and {circumflex over (b)}, and ∥â∥ denotes the magnitude ofvector â. C_(P1) and C_(P2) are process characteristic vectors—in them,each element contains a value between 0 and 1 that indicates howstrongly a particular characteristic manifests itself in each process.Characteristics can include keywords, type names, function references,etc. A is a value that equals 1 if any of the following apply:

[0494] P₁ and P₁ are instantiations of the same source.

[0495] P₁ and P₂ are unique instantiations of their own source.

[0496] P₁ and P₂ communicate with each other.

[0497] A equals 0 if none of these is true (e.g., P₁ and P₂ arenonunique instantiations of separate source that do not communicate witheach other). Coupling is quantified by:${{Coupling}(P)} = \frac{\sum\limits_{ij}{{Sim}_{f}\left( {p_{i},q_{j}} \right)}}{mn}$

[0498] where q_(j)∈Q, Q is the complement of P, and n=|Q|. The qualityof a cluster is quantified as its Coupling minus its Cohesion. In manycases, these metrics match many of the characteristics that intuitivelydifferentiate good and poor clusters, as shown in FIGS. 38A, B, and C.With reference to FIGS. 38A and C, Cohesion is high where clusterscorrespond to heavy communication and where clusters correspond toprocesses instantiated from the same source code. Coupling is shown tobe low in each of the above cases. With reference to FIG. 38B, Couplingis high when clusters do not correspond to heavily communicatingprocesses or to instances of the same source code. It is not clear,however, that the cluster in FIG. 38C should be assigned the samequality value as the cluster in FIG. 38A. Using these metrics, Kunzachieved qualities of between :15 and :31 for his clustering techniques.However, it is hard to tell what this means in terms of clusterusefulness.

[0499] 3. State-Based Debugging

[0500] State-based debugging techniques focus on the state of the systemand the state changes caused by events, rather than on eventsthemselves. The familiar source-level debugger for sequential programdebugging is state-based. This source-level debugger lets designers setbreakpoints in the execution of a program, enabling them to investigatethe state left by the execution to that point. This source-leveldebugger also lets programmers step through a program's execution andview changes in state caused by each step.

[0501] Concurrent systems have no unique meaning for an instant inexecution time. Stopping or single-stepping the whole system canunintentionally, but substantially, change the nature of interactionsbetween processes.

[0502] A. Consistent Cuts and Global State

[0503] In distributed event-based debugging, the concept of causality istypically of such importance that little of value can be discussedwithout a firm understanding of causality and its implications. Indistributed state-based debugging, the concept of a global instant intime is equally important.

[0504] Here again, it may seem intuitive to consider real-time instantsas the global instants of interest. However, just as determining thereal-time order of events is not practical or even particularly useful,finding accurate real-time instants makes little sense. Instead, aglobal instant is represented by a consistent cut. A consistent cut is acut of an event dependency graph representing an execution that (a)intersects each process exactly once and (b) points all dependenciescrossing the cut in the same direction. Like real-time instants,consistent cuts have both a past and a future. These are the subgraphson each side of the cut.

[0505]FIG. 39 shows that consistent cuts can be represented as a jaggedline across the space/time diagram that meets the above requirements.With reference to FIG. 39, a space/time graph 3900 is shown having afirst cut 3902 and a second cut 3904. All events to the left of eitherfirst cut 3902 or second cut 3904 are in the past of each cut, and allevents to the right are in the future of each cut, respectively. Firstcut 3902 is a consistent cut because no message travels from the futureto the past. Second cut 3904, however, is not consistent because amessage 3906 travels from the future to the past.

[0506]FIGS. 40A, B, and C show that a distributed execution shown in aspace/time diagram 4000 can be represented by a lattice of consistentcuts 4002, in which

is the start of the execution and ⊥ is system termination. Withreference to FIGS. 40A, B, and C, lattice of consistent cuts 4002represents the global statespace traversed by a single execution. Sincelattice of consistent cuts 4002's size is on the order of |E|^(|P|), it,unlike space/time diagrams, is never actually constructed. In theremainder of this chapter, to describe properties of consistent cutlattices, the symbol

[0507] relates cuts such that one immediately precedes the other and

relates cuts between which there is a path.

[0508] B. Single Stepping in a Distributed Environment

[0509] Controlled stepping, or single-stepping, through regions of anexecution can help with an analysis of system behavior. The programmercan examine changes in state at the completion of each step to get abetter understanding of system control flow. Coherent single-steppingfor a distributed system requires steps to align with a path through anormal execution's consistent cut lattice.

[0510] DPD works with standard single-process debuggers (called clientdebuggers), such as DBX, GDB, etc. Programmers can use these tools toset source-level break-points and single-step through individual processexecutions. However, doing so leaves the other processes executingduring each step, which can yield unrealistic executions.

[0511] Zernic gives a simple procedure for single-stepping using apost-mortem traversal of a consistent cut lattice. At each point in thestep process, there are two disjoint sets of events: the past set, orevents that have already been encountered by the stepping tool, and thefuture set, or those that have yet to be encountered. To perform a step,the debugger chooses an event, e_(i), from the future such that anyevents it depends on are already in the past, i.e., there are no futureevents, e_(f), such that e_(f)

e_(i). This ensures that the step proceeds between two consistent cutsrelated by

[0512] The debugger moves this single event to the past, performing anynecessary actions.

[0513] To allow more types of steps, POET's support for single-steppinguses three disjoint sets: executed, ready, and nonready. The executedset is identical to the past set in “Using Visualization Tools toUnderstand Concurrency,” by D. Zernik, M. Snir, and D. Malki, IEEESoftware 9, 3 (1992), pp. 87-92. The ready set contains all events thatare fully enabled by events in the future, and the contents of thenonready set have some enabling events in either the ready or nonreadysets. Using these sets, it is possible to perform three different typesof steps: global-step, step-over, and step-in. Global-step and step-overmay progress between two consistent cuts not related

[0514] (i.e., there may be several intermediate cuts between the stepcuts).

[0515] A global-step is performed by moving all events from the readyset into the past. Afterwards, the debugger must move to the ready setall events in the nonready set whose dependencies are in the executedset. A global-step is useful when the programmer wants information abouta system execution without having to look at any process in detail.

[0516] The step-over procedure considers a local, or single-process,projection of the ready and nonready sets. To perform a step, it movesthe earliest event from the local projections into the executed set andexecutes through events on the other processes until the next event inthe projection is ready. This ensures that the process in focus willalways have an event ready to execute in the step that follows.

[0517] Step-in is another type of local step. Unlike step-over, step-indoes not advance the system at the completion of the step; instead, thesystem advance is considered to be a second step. FIGS. 41A, B, C, and Dshow a space/time diagram before a step 4100 and a resulting space/timediagram after performing a global-step 4102, a step-over 4104, and astep-in 4106.

[0518] C. Runtime Consistent Cut Algorithms

[0519] It is occasionally necessary to capture consistent cuts atruntime. To do so, each process performs some type of cut action (e.g.,state saving). This can be done with barrier synchronization, whicherects a temporal barrier that no process can pass until all processesarrive. Any cut taken immediately before, or immediately after, thebarrier is consistent. However, with barrier synchronization, someprocesses may have a long wait before the final process arrives.

[0520] A more proactive technique is to use a process called the cutinitiator to send perform-cut messages to all other system processes.Upon receiving a perform-cut message, a process performs its cut action,sends a cut-finished message to the initiator, and then suspends itself.After the cut initiator receives cut-finished messages from allprocesses, it sends each of them a message to resume computation.

[0521] The cut obtained by this algorithm is consistent: no process isallowed to send any messages from the time it performs its own cutaction until all processes have completed the cut. This means that nopost-cut messages can be received by processes that have yet to performtheir own cut action. This algorithm has the undesirable characteristicof stopping the system for the duration of the cut. The followingalgorithms differ in that they allow some processing to continue.

[0522] 1. Chandy-Lamport Algorithm

[0523] The Chandy-Lamport algorithm does not require the system to bestopped. Once again, the cut starts when a cut initiator sendsperform-cut messages to all of the processes. When a process receives aperform-cut message, it stops all work, performs its cut action, andthen sends a mark on each of its outgoing channels; a mark is a specialmessage that tells its recipient to perform a cut action before readingthe next message from the channel. When all marks have been sent, theprocess is free to continue computation. If the recipient has alreadyperformed the cut action when it receives a mark, it can continueworking as normal.

[0524] Each cut request and each mark associated with a particular cutare labeled with a cut identifier, such as the process ID of the cutinitiator and an integer. This lets a process distinguish between marksfor cuts it has already performed and marks for cuts it has yet toperform.

[0525] 2. Color-Based Algorithms

[0526] The Chandy-Lamport algorithm works only for FIFO (First In FirstOut) channels. If a channel is non-FIFO, a post-cut message may outrunthe mark and be inconsistently received before the recipient is evenaware of the cut, i.e., it is received in the cut's past. The remedy tothis situation is a color-based algorithm. Two such algorithms arediscussed below.

[0527] The first is called the two-color, or red-white, algorithm. Withthis algorithm, information about the cut state is transferred with eachmessage. Each process in the system has a color. Processes not currentlyinvolved in a consistent cut are white, and all messages transmitted aregiven a white tag. Again, there is a cut initiator that sendsperform-cut messages to all system processes. When a process receivesthis request, it halts, performs the cut action, and changes its colorto red. From this point on, all messages transmitted are tagged with redto inform the recipients that a cut has occurred.

[0528] Any process can accept a white message without consequence, butwhen a white process receives a red message, it must perform its cutaction before accepting the message. Essentially, white processes treatred messages as cut requests. Red processes can accept red messages atany time, without consequence.

[0529] A disadvantage of the two-color algorithm is that the system mustreset all of the processes back to white after they have completed theircut action. After switching back, each process must treat red messagesas if they were white until they are all flushed from the previous cut.After this, each process knows that the next red message it receivessignals the next consistent cut.

[0530] This problem is addressed by the three-color algorithm, whichresembles the two-color algorithm in that every process changes colorafter performing a cut; it differs in that every change in colorrepresents a cut. For colors zero through two, if a process with thecolor c receives a message with the color (c−1) mod 3, it registers thisas a message-in-flight (see below). On the other hand, if it receives amessage with the color (c+1) mod 3, it must perform its cut action andswitch color to (c+1) mod before receiving the message. Of course, thiscan now be generalized to n-color algorithms, but three colors areusually sufficient.

[0531] Programmers may need to know about messages transmitted acrossthe cut, or messages-in-flight. In the two-color algorithm,messages-in-flight are simply white messages received by red processes.These can all be recorded locally, or the recipient can report them tothe cut initiator. In the latter case, each red process simply sends theinitiator a record of any white messages received.

[0532] It is not safe to switch from red to white in the two-coloralgorithm until the last message-in-flight has been received. This canbe detected by associating a counter with each process. A processincrements its counter for each message sent and decrements it for eachmessage received. When the value of this counter is sent to theinitiator at the start of each process's cut action, the initiator canuse the total value to determine the total number of messages-in-flight.The initiator simply decrements this count for each message-in-flightnotification it receives.

[0533] D. State Recovery—Rollback and Replay

[0534] Since distributed executions tend to be nondeterministic, it isoften difficult to reproduce bugs that occur during individualexecutions. To do so, most distributed debuggers contain a rollbackfacility that returns the system to a previous state. For this to befeasible, all processes in the system must occasionally save theirstate. This is called checkpointing the system. Checkpoints do not haveto save the entire state of the system. It is sufficient to save onlythe changes since the last checkpoint. However, such incrementalcheckpointing can prolong recovery.

[0535] DPD makes use of the UNIX fork system call to performcheckpointing for later rollback. When fork is called, it makes an exactcopy of the calling process, including all current states. In the DPDcheckpoint facility, the newly forked process is suspended and indexed.Rollback suspends the active process and resumes an indexed process. Theproblem with this approach is that it can quickly consume all systemmemory, especially if checkpointing occurs too frequently. DPD'ssolution is to let the programmer choose the checkpoint frequencythrough use of a slider in its GUI.

[0536] Processes must sometimes be returned to states that were notspecifically saved. In this case, the debugger must do additional workto advance the system to the desired point. This is typically calledreplay and is performed using event trace information to guide anexecution of the system. In replay, the debugger chooses an enabledprocess (i.e., one whose next event has no pending causal requirements)and executes it, using the event trace to determine where the processneeds to block for a message that may have arrived asynchronously in theoriginal execution. When the process blocks, the debugger chooses thenext enabled process and continues from there. In this way, a replay iscausally identical to the original execution.

[0537] Checkpoints must be used in a way that prevents domino effects.The domino effect occurs when rollbacks force processes to restore morethan one state. Domino effects can roll the system back to the startingpoint. FIG. 42 shows a space time diagram 4200 for a system that issubject to the domino effect during rollback. With reference to FIG. 42,if the system requests a rollback to checkpoint c₃ 4202 of process P₃4204, all processes in the system must roll back to c₁ (i.e., roll backto P₃. c₂ 4206 requires a roll back to P₂. c₂ 4208, which requires aroll back to P₁. c₂ 4210, which requires a roll back to P₃. c₁ 4212,which requires a roll back to P₂. c₁ 4214, which requires a final rollback to P₁. c₁ 4216). The problem is caused by causal overlaps betweenmessage transfers and checkpoints. Performing checkpoints only atconsistent cuts avoids a domino effect.

[0538] E. Global State Predicates

[0539] The ability to detect the truth value of predicates on globalstate yields much leverage when debugging distributed systems. Thistechnique lets programmers raise flags when global assertions fail, setglobal breakpoints, and monitor interesting aspects of an execution.Global predicates are those whose truth value depends on the statemaintained by several processes. They are typically denoted with thesymbol Φ. Some examples include (Σ_(i)c₁>20) and (c₁<20^ c₂5), wherec_(i) is some variable in process P_(l) that stores positive integers.In the worst case (such as when (Σ_(i)c_(i)>20) is false for an entireexecution), it may be necessary to get the value of all such variablesin all consistent cuts. In the following discussion, we use the notationC_(a)|=Φ to indicate that Φ is true in consistent cut C_(a).

[0540] At this point, it is useful to introduce branching time temporallogic. Branching time temporal logic is predicate logic with temporalquantifiers, P, F, G, H, A, and E. PΦ is true in the present if Φ wastrue at some point in the past; FΦ is true in the present if Φ will betrue at some point in the future; GΦ is true in the present if Φ will betrue at every moment in the future; and HΦ is true in the present if Φwas true at every moment of the past. Notice that GΦ is the same as

F

AΦ, and HΦ is the same as

P

Φ.

[0541] Since global time passage in distributed systems is marked by apartially ordered consistent cut lattice rather than by a totallyordered stream, we need the quantifiers A, which precedes a predicatethat is true on all paths, and E, which precedes a predicate that istrue on at least one path. So, AFΦ is true in the consistent cutrepresenting the present if Φ is true at least once on all paths in thelattice leaving this cut. EPΦ is true in the consistent cut representingthe present if Φ is true on at least one path leading to this cut.

[0542] A monotonic global predicate is a predicate Φ such that C_(a)|=Φ

C_(a)|=AGΦ. A monotonic global predicate is one that remains true afterbecoming true. An unstable global predicate, on the other hand, is apredicate Φ such that C_(a)|=Φ

C_(a)|=EG

Φ. An unstable global predicate is one that may become false afterbecoming true.

[0543] 1. Detecting Monotonic Global Predicates

[0544] Monotonic predicates can be detected any time after becomingtrue. One algorithm is to occasionally take consistent cuts and evaluatethe predicate at each. In fact, it is not necessary to use consistentcuts, since any transverse cut whose future is a subset of the future ofthe consistent cut in which the predicate first became true will alsoshow the predicate true.

[0545] 2. Detecting Unstable Global Predicates

[0546] Detecting arbitrary unstable global predicates can take at worst|E|^(|P|) time, where |E|^(|P|) is the size of an execution's consistentcut lattice, [E] is the number of events in the execution, and [P] isthe number of processes. This is so, because it may be necessary to testfor the predicate in every possible consistent cut. However, there are afew special circumstances that allow |E| time algorithms.

[0547] Some unstable global predicates are true on only a few pathsthrough the consistent cut lattice, while others are true on all paths.The prior art describes predicate qualifiers definitely Φ for predicatesthat are true on all paths (i.e.,

|=AFΦ) and possibly Φ for those that are true on at least one path(i.e.,

|=>EFΦ).

[0548] The detection of possibly Φ for weak conjunctive predicates, orglobal predicates that can be expressed as conjunctions of localpredicates, is φ(|E|). The algorithm for this is to walk a path throughthe consistent cut lattice that aligns with a single process, P_(t),until either (1) the process's component of Φ is true or (2) there is noway to proceed without diverging from P_(t). In either case, the targetprocess is switched and the walk continued. This algorithm continuesuntil it reaches a state in which all components of the predicate aretrue or until it reaches ⊥. In this way, if there are any consistentcuts where all parts of the predicate simultaneously hold, the algorithmwill encounter at least one.

[0549] Detection of possibly Φ for weak disjunctive predicates, orglobal predicates that can be expressed as disjunctions of localpredicates, is also φ(|E|); it is the same algorithm as above, except ithalts at the first node where any component is true. However, weakconjunctive and disjunctive predicates constitute only a small portionof the types of predicates that could be useful in debugging distributedsystems.

[0550] 4. Conclusions

[0551] Complicating the debugging of heterogenous embedded systems aredesigns composed of concurrent and distributed processes. Most of thedifficulty in debugging distributed systems results from concurrentprocesses with globally unscheduled and frequently asynchronousinteractions. Multiple executions of a system can produce wildly varyingresults—even if they are based on identical inputs. The two maindebugging approaches for these systems are event based and state based.

[0552] Event-based approaches are monitoring approaches. Events arepresented to a designer in partially ordered event displays, calledspace/time displays. These are particularly good at showinginter-process communication over time. They can provide a designer withlarge amounts of information in a relatively small amount of space.

[0553] State-based approaches focus locally on the state of individualprocesses or globally on the state of the system. Designers can observeindividual system states, set watches for specific global predicates,step through executions, and set breakpoints based on global statepredicates. These approaches deal largely with snapshots, consideringtemporal aspects only as differences between snapshots.

[0554] As distributed systems increase in size and complexity, the sheervolume of events generated during an execution grows to a point where itis exceedingly difficult for designers to correctly identify aspects ofthe execution that may be relevant in locating a bug. For distributedsystem debugging techniques to scale to larger and faster systems,behavioral abstraction will typically become a necessity to helpdesigners identify and interpret complicated behavioral sequences in asystem execution. Finally, embedded systems must execute in a separateenvironment from the one in which they were designed and embeddedsystems may also run for long periods of time without clear stoppingpoints. Debugging them requires probes to report debugging informationto a designer during the execution. These probes inevitably alter systembehavior, which can mask existing bugs or create new bugs that are notpresent in the uninstrumented system. While it is not possible tocompletely avoid these probe effects, they can be minimized throughcareful placement, or masked through permanent placement.

[0555] Static Control Graphs

[0556] A static control graph (SCG) is a graph-theoretic representationof all pure control constraints. FIG. 43 shows a simple SCG. It is abi-partite digraph, having two types of nodes: conjunctive nodes 4300,which, as the name implies, produce results only when all incident edges4302 are satisfied, and disjunctive nodes 4304, 4306, and 4308, whichproduce results if any incident edge 4302 is satisfied. Disjunctivenodes 4304, 4306, and 4308 correspond to modes 102 in components 100 andcoordinators 410 (as previously shown in FIG. 1 and FIG. 4) throughoutthe system. An SCG for a complete system simultaneously represents allcontrol constraints.

[0557] An SCG is a triple, G=(C, D, E), in which:

[0558] C is a set of conjunctive nodes 4300.

[0559] D is a set of disjunctive nodes 4304, 4306, and 4308.

[0560] E

[{T_(f), T_(t)}×{H_(f);Ht}×((C×D)∪(D×C))] is a set of directed, labelededges 4302. Edges are sensitive to either a false value or a true valueat their tail 4310 (T_(f) or T_(t)) and enforce either a false value ora true value at their head 4312 (H_(f) or H_(t)). These are representedvisually by a bubble at the appropriate end for a false value or thelack of a bubble for a true value.

[0561] An edge 4302 in an SCG can be either enabled or disabled; itproduces the value true or the value false. FIG. 44 illustrates agraphic notation for edge labels. Edges 4400 and 4402 marked with abubble on head 4312, as in FIG. 44A and FIG. 44B, assert the value falsewhen activated. When there is no mark on head 4312, as in FIG. 44C andFIG. 44D, an edge 4404 and 4406 asserts the value true when activated. Abubble on tail 4310, as in FIG. 44B and FIG. 44D, indicates that edge4406 and 4402 is sensitive to false on the node it exits. The lack ofsuch bubbles, as in FIG. 44A and FIG. 44C, indicates that edge 4404 and4400 is sensitive to true.

[0562] Referring back to FIG. 43, the figure shows a simple SCG in whicha node d 4306 must be active whenever a conjunction (a

b

c) 4314 is active and inactive whenever a node e 4308 is active.Although this looks similar to a Boolean network, it differs because theSCG edges represent implication, not connection. This is illustrated inFIG. 45, which shows a Boolean network OR node 4500; when all inputs andoutputs are negated, it is equivalent to an AND node 4502 (byDeMorgan's). A disjunctive SCG node with all inputs and outputs negated4504 is equivalent to a disjunctive node with no inputs and outputsnegated 4506.

[0563] Each SCG has a Boolean characteristic function. This is a Boolean164 function that is true for each configuration in which no constraintsare violated and that is false for configurations with violatedconstraints. FIG. 46 shows two SCGs along with their reducedcharacteristic functions, in which the functions (i.e., a

b

c) are reduced to functions using only basic operators, (i.e.,

(a

b

c)). FIG. 46A shows conjunction without negation, whereas FIG. 46B showsconjunction with negation. Characteristic functions for SCGs withseveral conjunctive nodes are the conjunction of all constraints and, assuch, may not be satisfiable.

[0564]FIG. 47 shows the impact of edge semantics on SCG. Edges that areincident upon conjunctive nodes (i.e., edges that take the form (d, c)for d∈D, c∈C) are called sensing edges 4700 and 4702; edges incidentupon disjunctive nodes (i.e., edges that take the form (c, d)) arecalled enforcing edges 4704 and 4706. All edges and nodes are labeledwith their respective source object (i.e., a mode/control portcombination for a disjunctive node, and a constraint for a conjunctivenode). When a bubble is placed at the head of an enforcing edge 4708, asin FIG. 47A, it has different semantics than when the bubble is placedat the tail of an enforcing edge 4710, as in FIG. 45B; at the head of asensing edge 4712, as in FIG. 45C; or at the tail of a sensing edge4714, as in FIG. 45D. However, the latter three have identicalsemantics.

[0565] Activation influences for conjunctive nodes are always apparentin an SCG: the disjunctive nodes that appear at the tail side ofincident edges. As a result, conjunctive edges need never be labeled.Disjunctive nodes are frequently mapped to modes in the system; hence,they may have hidden activation influences. In these circumstances, nodelabels must be applied to indicate the modes to which they are mapped.It may not always be possible for all edges to assert their value. Edgesthat are prevented from doing so are said to be violated.

[0566] 1. Instability and Dynamic Properties

[0567] Although SCGs embody static relationships between modes, throughunstable configurations (viz., configurations that are temporarilyinvalid and must hence attempt resolution), they can also model somedynamic properties. Recall the behavior of a rendezvous coordinator4800; it contains interfaces 4802, 4804, 4806, and 4808 for two types ofcomponents: resource users 4810 and resources 4812. Rendezvouscoordinator 4800 lets resource users 4810 enter a waiting mode 4814, andresources 4812 enter an available mode 4816. When possible, rendezvouscoordinator 4800 releases a waiting component 4818 and an availablecomponent 4820 together. This can be modeled by the SCG shown in FIG.48B. The SCG has three conjunctive nodes 4822, 4824, and 4826 that eachsense (1) whether a particular wait node 4828, 4830, or 4832 and anavail node 4834 are simultaneously active and (2) whether there are anyactive wait nodes 4828, 4830, and 4832 with precedence. When one ofconjunctive nodes 4822, 4824, or 4826 is satisfied, it releases itsrespective wait node 4828, 4830, or 4832 and avail node 4834—causingconjunctive node 4822, 4824, or 4826 to cease being satisfied.

[0568] An important property of an SCG is the any stable state property(ASSP). ASSP of SCGs states that there is at least one stableconfiguration. An SCG is said to have this property if it has at leastone configuration in which all constraints are simultaneously satisfied.A graph without the ASSP is shown in FIG. 49. Trying to enforce allconstraints results in an inactive node b 4900 forcing a node c 4902 tobe inactive, which in turn forces node b 4900 to be active, which forcesnode c 4902 to be active—forcing node b 4900 to be inactive andrestarting the cycle.

[0569] It can be shown that finding the ASSP is NP-Complete, meaningthat there are probably no general, efficient algorithms for determiningwhether an SCG has this property. A 3-SAT problem, which is a well-knownNP-Complete problem, commonly used to prove by polynomial time reductionthat other computational problems are NP-Hard, can be reduced to ASSP inpolynomial time. FIG. 50A shows the SCG for a 3-SAT problem. FIG. 50Bshows the 3-SAT problem. FIG. 50C shows the representativecharacteristic functions for the 3-SAT problem.

[0570] Two enforcing edges can conflict and possibly cause raceconditions in certain configurations. To avert this, edge labels alsocontain a priority to indicate which will be enforced if there is aconflict. Usually, these priorities are supplied by the coordinatorsthat define the constraint and hence the enforcing edge. Althoughfinding a complete constraint system solution is NP-Hard, it is possibleto present a solution consistent with a designer's expectations in lessthan exponential time.

[0571] 2. Petri-net Similarities

[0572] SCGs are similar to Petri-nets in several significant ways.Petri-nets are also bi-partite digraphs with the node types—transitionsand places. A system state is represented by a marking of places, whereeach place can be either marked by a token or not marked at all. Atransition fires if there is a token on each of the places on theopposite side of incoming edges. When a transition fires, it consumesall enabling tokens and places tokens on each of the places on theopposite side of outgoing edges.

[0573] The main similarity between Petri-nets and SCGs is theconjunctive and disjunctive behaviors of transitions and places. The wayin which a place in a Petri-net becomes marked is similar to the way inwhich a disjunctive node in an SCG is changed—namely, incident edgesmanipulate the state directly. The main difference is that, withPetri-nets, the only way a place can become unmarked is if a transitionon an outgoing edge removes the token. With SCGs, the only conjunctivenodes that can cause changes in state are those on the incoming edges.

[0574] 3. Construction of SCGs

[0575] SCG construction uses Boolean constraints from a standardproduct-of-sums. A standard form Boolean constraint is a tuple (I, O, R,A), in which:

[0576] I is a set of input literals.

[0577] O is a set of output literals.

[0578] R is a set of disjunctions on input values, i.e., R

2^(I).

[0579] A is a set of output literals matched with conjunctions on valuesin R and values in C

O×2^(R).

[0580] Algorithm 1 shows how an SCG is constructed from a set ofconstraints in a system. Algorithm 1 - SCG construction Require: M, aset of modes, and S, a set of boolean constraints in standard form.create G = (C, D, E), a new control graph for all m ε M do add new dm toD // add a disjunctive node end for for all s ε S do for all a ε s.A doadd new c_(sa) to C // add a conjunctive node for all o ε a.O do add newedge (c_(sa), o) to Ε end for for all i ε s.I do add new edge (i, c_(sa)) to Ε end for end for end for

[0581] Algorithm 1 takes a set of modes and constraints, and for eachmode generates a new disjunctive node, and for each constraint adds anew conjunctive node. For each left argument of the constraint, thealgorithm adds a new edge from the disjunctive node that corresponds tothe argument to the conjunctive node. For each right argument of theconstraint, the algorithm adds a new edge from the conjunctive node tothe disjunctive node that corresponds to the argument.

[0582] 4. Constraint Conflict Detection

[0583] The most important use of SCGs is in finding constraintenforcement conflicts. Constraint enforcement conflicts occur when twoactive constraints try to force a disjunctive node in oppositedirections. Since constraints are prioritized, any conflicts can beresolved at runtime in favor of the constraint with the highestpriority. However, resolution in this fashion can cause unexpected,undesired behavior (i.e., bugs) in the embedded system. Pre-preemptivedebugging, in the form of constraint conflict detection, aids designersin catching such problems.

[0584] There are two different types of constraint conflicts: firstorder and n^(th) order. First order conflicts are fairly straightforwardto detect, but detecting n^(th) order conflicts is, in general, as hardas detecting ASSP. A number of techniques can be used to reduce thecomplexity of n^(th) order conflict detection. The first techniqueconservatively approximates conjunctive nodes as disjunctive nodes. Thistechnique is polynomial and detects all possible conflicts, but it mayalso deliver a number of false positives. The second technique exploitsthe hierarchical properties of systems designed usingcoordination-centric modeling to cache partial results; for certain SCGscalled well-composed, this technique can deliver results in polynomialtime. Many subgraphs in typical SCGs are replicated over and over. Inthese cases, replication can be exploited to deliver results inpolynomial time. Finally, subgraphs may not relate to others, and inthese cases, it is possible to analyze them separately.

[0585] A. First Order Conflicts

[0586] First order conflicts occur when two potentially coactiveenforcing edges have opposite effects on the same disjunctive node. FIG.51A shows one such conflict: if nodes a, b, c, and d 5100, 5102, 5104,and 5106, respectively, are all simultaneously active, a conflict 5108occurs between c

d 5110 and a

b 5112. Since the conflicting edges 5114 and 5116 are prioritized, it iseasy for the runtime system to dynamically resolve this in favor of c

d 5110.

[0587] First order conflicts are easy to detect—for each disjunctivenode, simply find all enforcing edges that conflict in sense and traceback to their respective conjunctive nodes.

[0588] B. n^(th) Order Conflicts

[0589]FIG. 51B shows a potential simple second order conflict betweenthe terms d

g

h 5118 and a

b 5120. In this case, priority labels 5122, 5124, and 5126 of the edgescause the d

g

h 5118 term to be ignored whenever there is a conflict between them.Cyclic n^(th) order conflicts are particularly bad, because they cancause instability in the underlying system—even with priorityassignments.

[0590] Finding and eliminating n^(th) order conflicts is another NP-Hardproblem, so we are reduced to practical techniques that seem to performwell given the characteristics of common SCGs. Experience has shown thatthese graphs frequently embody exponential control state spaces;therefore, it is essential to avoid enumerating the entire control statespace in conflict detection.

[0591] Algorithm 2 shows one conflict detection algorithm that canusually attain reasonable performance. This algorithm produces acomplete closure of the SCG, such that conflicts are identified byconflicts in conjunctive node output.

[0592] The conjunction between terms is represented by a singleconjunctive node, and a potential conflict is realized if and only ifall intermediate disjunctive nodes are satisfied. In practice, SCGs andthe systems from which they are derived often have characteristics thatcan be exploited to find these potential conflicts efficiently.Algorithm 2 - Flattening static control graphs Require: (D, C, Ε) is acopy of a system's static control graph for all d_(l) ε D, (c_(j),d_(i)) and (d_(l), c_(k)) ε Ε do if consistent ((c_(j), d_(k)), (d_(l),c_(k))) then if absent (c_(j), c_(k)) then add new c_(jk) to C for all(d_(m); c_(j)) ε Ε do add new (d_(m); c_(jk)) to Ε end for for all(d_(n); c_(k)) ε Ε with d_(n) ≠ d_(i) do add new (d_(n), c_(jk)) to Εend for for all (c_(k), d_(p)) ε Ε do add new (c_(jk), d_(p)) to Ε endfor end if end if end for

[0593] Algorithm 2 finds all disjunctive nodes and, for each of them,takes all consistent edges and creates a new conjunctive node for each.For each of the new conjunctive nodes, the algorithm creates new edges(1) for each edge on the conjunctive node, on the tail side of the leftedge and (2) for each edge on the conjunctive node, on the head side ofthe right edge. The consistency check ensures that no literals in oneconjunctive node conflict with literals in the other (e.g., it willreturn false if c_(j) contains a and C_(k) contains

a). The absence check ensures that the new node will not be redundant.

[0594]FIG. 52A shows a portion of an SCG before application of Algorithm2 (flattening), and FIG. 52B shows the SCG after flattening. A newconstraint between end nodes 5200, 5202, 5204, and 5206 is constructed;it includes _(C1) 5210 and edges 5212, 5214, and 5216 entering andleaving C₁ 5210. This constraint represents the cascading effect througha node d_(i) 5208.

[0595] To find an upper bound on the space and time required, assumethat each possible consistent, distinct, conjunctive node will becreated and that each of these will fan out to all disjunctive nodes.This means that for n disjunctive nodes, space for 3^(n) conjunctivenodes and 2n3^(n) edges (at most n edges on both sides of eachconjunctive node) is required. Assuming that a vast majority of theseare created during the execution of the algorithm, time must beapproaching 4n3^(n). These space and time requirements are extremelysensitive to the initial fan-in and fan-out of nodes in the graph, theinitial number of conjunctive nodes, and the length of acyclic,self-consistent paths (i.e., paths along which a single change effectcan be propagated).

[0596] Once the graph has been flattened, true conflicts can beascertained by tracing back pairs of edges that assert different valueson any disjunctive node and by determining whether their sourceconjunctive nodes are mutually exclusive. An example is shown in FIG.53. Outgoing edges 5300 and 5302 from C₁ 5304 and C_(k) 5306 form apotential conflict 5308. To determine whether a conflict exists, theliterals that make up the respective conjunctions are compared, treatingunused variables as “don't cares.” As Table 6 shows, conflict 5308occurs for the configuration d_(i)

d_(j)

d_(k)

d_(l). TABLE 6 Finding the conflict from FIG. 54. d_(i) d_(j) d_(k)d_(l) C_(l) 1 1 0 — C_(k) — — 0 1 equivalent 1 1 1 1 conflict 1 1 0 1

[0597] Algorithm 2 highlights instability, which shows up as simplecycles. For example, the instability in FIG. 49 is visible in FIG. 54 asfour simple cycles 5400, 5402, 5404, and 5406 after flattening.

[0598] There are several techniques available to improve the performanceof Algorithm 2 for a wide variety of SCGs. The following subsectionsdescribe some of them.

[0599] 1. Disjunctive Graph Approximation

[0600] A special case of SCGs is one in which each conjunctive edge hasonly a single input. In this case, Algorithm 2 becomes a variety ofWarshall's algorithm, in which conjunctive nodes just definerelationships between disjunctive nodes. If there are only a fewconjunctive nodes with more than one incident edge, each of them can bereplaced with several conjunctive nodes with single incident edges.

[0601] This approach yields a conservative approximation of theconflicting edges, with the possibility of a large number of falsepositives. If the number of false positives that concern the designer issmall, they can be individually verified against the original graph.

[0602] 2. Hierarchical Reduction

[0603] Recall that one element of a component's coordination interfaceis a set of guarantees. Guarantees are summaries of component propertiesthat are already verified. These guarantees can include summaries ofinternal relationships between control ports on the interface. FIG. 55Ashows one such summary. Although the actual relationship between x 5500and

y 5502 involves a number of internal nodes 5504, 5506, 5508, and 5510,as seen in FIG. 55, none of them are part of the interface; therefore,they can be summarized as a single, independent node 5512, marked with“?”, as shown in FIG. 55B.

[0604] Since our methodology encourages hierarchical composition, stablestate consistency can often be applied hierarchically to these graphs.In this approach, each component provides a summary of the relationshipsbetween all interface modes. A system that is well-composed (as shown inFIG. 56) in this fashion will have the number of nodes visible in eachscope less than some reasonable constant c, and the ratio of the numberof non-interface nodes within a scope to the number of interface nodesis always greater than some constant factor d, where d>1. With this,there are n log_(d) n/c different scopes to analyze, and the timerequired for each scope

[0605] is then less than or equal to 4c3^(c); thus the time required foran entire well composed graph is:$t \leq \frac{4{c3}^{c}n\quad \log_{d}n}{c}$

[0606] which can be stated as t≦Pn lg n, where P is the constant$\frac{4\left( 3^{c} \right)}{1g^{d}}.$

[0607] This means that hierarchical, stable-state consistency forwell-composed graphs is O (n lg n).

[0608] While it is not possible to force arbitrary systems to fit thisconstruct, it is possible to determine whether a system is well-composedbefore attempting this technique. Furthermore, for components suppliedwith models, the system can still use the interaction summary;therefore, it is not always necessary for components to be well-composedinternally. Since the summary can be cached with components, the costfor preparing it can be amortized over many attempts at analysis.

[0609] 3. Exploiting Replication

[0610] There are many protocols that have a large number of connectedcomponents, but how the components interact is independent of theirnumber. In protocols such as token ring or subsumption, in which theinterface is replicated with predictable relationships betweeninstances, it may not be necessary to verify all possible conflictsbetween all components. For example, with subsumption, componentsinteract cleanly with the protocol, and conflicts between components areminimal. The conflicts between a component and a protocol can be treatedas independent from the other components plugged into a protocol, so thecomponent can be simply checked against the protocol.

[0611] 4. Orthogonal Reduction

[0612] Often, interactions between interfaces in a layered system arethrough opaque actions. Therefore, each layer can be considered aseparate entity, and the graph for each layer can be constructedindependently of the other graphs. In these cases, flattening need notconsider the system as a whole, but merely the subgraphs for eachparticular layer.

[0613] Dynamic Control Graphs

[0614] A dynamic control graph (DCG) includes pure control actions aswell as pure control constraints. DCGs can be defined as a triple(Γ_(d), D_(d), E_(e)) wherein:

[0615] Γ_(d) is a set of conjunctive and action nodes, i.e.,

Γ_(d)

(C∪A),

[0616] where C is the universe of all possible conjunctive nodes, and Ais the universe of all possible actions.

[0617] D_(d) is a set of disjunctive nodes.

[0618] E_(e)

[{T_(f), T_(t)}×{H_(f), H_(t)}×{T, N}×((Γ_(d)×D_(d))∪(D_(d)×Γ_(d)))] isa set of directed edges that are labeled either Transient (T) orContinuous (N).

[0619]FIG. 57 depicts a DCG 5700. With reference to FIG. 57, a purecontrol action (not shown) is a transparent action that is triggeredonly by a control transient and that produces only a control transient.In DCG 5700, the pure control action is represented as a conjunctivenode, an action node 5702, with outgoing dashed edges 5704 and 5706, andincoming dashed edge 5708. A pure control constraint (not shown) is onethat constrains only modes (not shown). As described above, an actioncan be one of two types: an instantaneous action or a delayed action.The instantaneous action type is executed immediately when its triggeris received, and the delayed action type is executed at some point intime after its trigger is received. The delay of an action is embodiedin its action node 5702.

[0620]FIG. 58 depicts a DCG 5800 with an action node 5802. Withreference to FIG. 58, action node 5802 has no apparent trigger. Withthis, DCG 5800 fully characterizes all control aspects of the action(although the data aspect, which in this case is the trigger, is notincluded in any control graph). FIG. 59 shows a DCG 5900 for an action(not shown) that is transparent with respect to control interactions.FIG. 60 shows a DCG 6000 for an action (not shown) that is opaque withrespect to control; the graph shows a conservative approximation of theaction's control behavior.

[0621] As mentioned earlier, it is better to model coordinatortransitions through explicit actions rather than through static controlgraph (SCG) instability. FIGS. 61A and 61B are two DCGs 6100 and 6102,for rendezvous coordinator 900, of FIG. 9, and a rendezvous coordinatorwith two-participant preemption, respectively. With reference to FIGS.61A and 61B, there are n control actions (not shown) (one for eachcomponent (not shown) with a wait coordination interface (not shown)).Each of the n control actions is represented in DCGs 6100 and 6102 by anaction node 6104, and each action node is guarded by the conjunction ofall wait modes 6106 with lower precedence being inactive. When an actionis enabled and triggered, it deactivates its respective wait mode 6106and an avail mode 6108.

[0622] DCGs expose hidden interactions between coordinators. As shown inFIG. 61B, DCG 6102 reveals that an interaction between wait_(b) andwait_(c), of wait modes 6106, occurs outside of the rendezvouscoordinator with two-participant preemption. The interaction is shown asa combination of edges 6110 and 6112, and preempt action node 6114. Asshown, the interaction might cause wait_(c) to intercept wait_(b)'s waitfor the resource.

[0623] A. Partitioning SCGs and DCGs Across Action-Only Barriers

[0624] A useful static graph transformation that can be performed onboth SCGs and DCGs is partitioning across action-only barriers. Anaction-only barrier is a barrier within the system across which statecannot be maintained. FIG. 62A shows a communication channel betweenpartitions of an SCG 6200 that can cause an action-only barrier. FIG.62B shows a DCG 6202, corresponding to SCG 6200 after partitioningacross the access-only barrier. In applying this to DCGs, all transientedges are left unchanged.

[0625] To perform the transformation across the action-only barrier thefollowing steps are performed. First, the nodes within SCG 6200 thatwill be placed on each side of the action only barrier (in other words,perform a graph cut across the barrier, as in FIG. 62A) are identified.Second, each constraint edge that crosses the action-only barrier isreplaced with an appropriate template. FIG. 63 depicts constraint edges6300 and 6302 that cross the action-only barrier, from FIG. 62A, andtheir corresponding templates 6304 and 6306. Finally, an action nodecreated in the last step is filled in with an action sensitive to theactivation (or deactivation) of the disjunctive node d opposite theincident edge.

[0626] B. Action/Constraint Conflicts

[0627] It is possible for an action to conflict with stable statesduring the execution of a software system. If the action has higherpriority than a constraint on the stable state, then a glitch will occurwhenever the situation is exercised. The glitch infers a system statethat may, or may not, actually be entered. However, the effects of theglitch can be propagated through the rest of the software system, eventhough the system state inferred by the glitch has not been entered. Onthe other hand, if the relevant constraint has a higher priority thanthe action, then the action cannot be performed in a given systemconfiguration and may be a candidate for removal from the softwaresystem.

[0628] C. Action/Action Conflicts

[0629] An action can sometimes conflict with another action.Action/action conflicts are detected using similar checks to those givenfor stable-state consistency. These checks may occasionally leave two ormore triples with overlapping modes and triggers but with contradictoryactions. These contradictory actions can be resolved in a number ofways, some static and some dynamic. A static solution to thecontradictory actions is to conservatively eliminate any possibleconflicting portions from one of the triples. An example of a runtimesolution is to allow the conflicting actions to propagate throughout therest of the triples, resolving the conflict based on priority when it istime to lock down the new configuration.

[0630] 4. Model Checking

[0631] SCGs and DCGs can be used directly for a wide variety of checksinvolving what is allowed in particular configurations and what effectscan result from changes on a small scale. However, for checkingproperties that may span several different system configurations in asequence, other transformations can perform these checks moreefficiently.

[0632] Model checking typically describes techniques in which finitestate system models are checked against predicates in temporal logic.Control graph configurations, both SCGs and DCGs, represent systemwidestate for a software system; it therefore follows that many abstractcontrol graph properties can be verified using standard model checking.A binary decision diagram (BDD) can be a compact, though not alwaysoptimal, representation of transition relations of an extremely largestate machine. BDDs frequently provide representations of systemwidestate that are logarithmic relative to the size of the software system'sstatespace. SCGs and DCGs are also compact representations of largestatespaces. However, there are many standard checks that can be easilyperformed on BDD representations that would be difficult to perform on aSCG or DCG representation of a software system.

[0633] To take advantage of the standard checks available for BDDs, apreferred embodiment of the current invention includes a system andmethod for converting SCGs and DCGs into BDDs, without incurring thepenalty of fully elaborating the state space of the software system.

[0634] A. Temporal Unrolling

[0635]FIG. 64 depicts a current DCG 6400 along with a next DCG 6402,which is the result of temporally unrolling DCG 6400. With reference toFIG. 64, a temporal line 6404 separates current DCG 6400 from next DCG6402. Temporal unrolling of current DCG 6400 allows a current controlgraph configuration, embodied in current DCG 6400, to be related to aset of next configurations, embodied in next DCG 6402, of the DCG for aparticular software system. Unrolling current DCG 6400 across temporalline 6404 involves making a copy, to the right of temporal line 6404, ofall disjunctive nodes, constraints, and all nondelayed actions withincurrent DCG 6406. For clarity, a prime is added to the label of eachcopied element. For each delayed action within current DCG 6400, asensing edge 6406 is connected to each appropriate nonprimed node 6408,and an enforcing edge 6410 is connected to each appropriate primed node6412. When a sensing edge is an event edge, some additional nodes mustbe created. For an event sensing edge 6414, a new disjunctive node 6416,which represents the event itself, is created. A new conjunctive node6418 tying new conjunctive node 6418 back to a creating node 6420 thatcreated the event, constraining new conjunctive node 6418 state and itsprimed state appropriately (e.g., a+f event can only occur when f isalready deactivated, but it leaves f′ activated, as shown).

[0636] This unrolled DCG has the characteristic function:

f _(c)=(

a

b

c)

(

a′

b′

c′)

(

f

+f)

(

+

f

f′)

(

e

+f

b′)

(

e

+f

g′).

[0637] This function now encodes the relation between currentconfiguration and next configuration, which is annotated

[0638] for configurations C and C′.

[0639]FIG. 65A depicts a simple DCG 6500. FIG. 65B depicts an unrolledDCG 6502 for simple DCG 6500. With reference to FIGS. 65A and 65B, thecharacteristic function of unrolled DCG 6502 is as follows:

fc=(

a

c)

(

c

+c)

(

+

c

a′)

(

a′

c′)

(

|

c

c′).

[0640] and its transition relation is enumerated in Table 7. TABLE 7 Atransition relation for the characteristic function of unrolled DCG6502. C C′ a c +c a′ c′ 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 1 0 00 0 1 0 0 1 0 1 0 1 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1

[0641] B. Computation Tree Logic

[0642] Application of model checking requires a predicate logic that ispowerful enough to express the temporal state relationships that will bechecked. Computation tree logic (CTL) is a superset of the temporallogic introduced above, and it includes two new operators: X (for neXt)and U (for Until). AXΦ is true in the present if Φ is true in allpossible next states. EXΦ is true in the present if Φ is true in atleast one of the possible next states. A (Φ₀ UΦ₁) is true in the presentif, on all outgoing paths, Φ₀ is true until Φ₁ becomes true.

[0643] The lambda calculus provides a flexible form for representingtemporal logic expressions. Lambda expressions (i.e., expressions takingthe form λx.E) are often used to represent functionals, or functionsthat operate on other functions. The expression λx.E p represents theexpression E with p replacing each occurrence of x. For example,λx.(x+y) 3 is equivalent to (3+y), and λx.(x+y) 3y+z is equivalent to(3y+z+y) or (4y+z).

[0644] To make use of this with CTL, first consider that

S

EXΦ

[0645] is equivalent to${\exists{y\left( {S\overset{c}{->}{{y\hat{}y} \vDash \Phi}} \right)}};$

[0646] the notation

S _(a)

Φ

[0647] indicates that Φ is true in state S_(a).

[0648] Therefore, a lambda expression representing all configurationsfor which predicate Φ is true on the next step on some path would be:

[0649]${E\quad X\quad \Phi} = {\lambda \quad {x \cdot {\exists{C \cdot \left( {\left( {x\overset{c}{->}C} \right)\hat{}\left( {C \vDash \Phi} \right)} \right)}}}}$

[0650] This defines EXΦ as a functional that can be applied to aconfiguration Z and that evaluates to true if and only if

Z

EXΦ.

[0651] Using this, along with boolean functions representing thetransition relation and all configurations for which Φ is true, we canderive an additional boolean function that represents all configurationsfor which EXΦ is true.

[0652] Consider again the graph in FIG. 65. The characteristic functionof unrolled DCG 6502 is

f _(c)=(

a

c)

(

a′

c′)

(

c

a′).

[0653] To find a function that represents al l configurations such thatEXΦ where Φ equals a

c, we have: $\begin{matrix}{{{EX}\left( {a\bigwedge c} \right)} = {\exists{C^{\prime} \cdot \left( {\left. {{fc}\bigwedge C^{\prime}} \right| = \left( {a\bigwedge c} \right)} \right)}}} \\{= {\exists{C^{\prime} \cdot \left( {{fc}\bigwedge\left( {a^{\prime}\bigwedge c^{\prime}} \right)} \right)}}} \\{= {\exists{C^{\prime} \cdot \left( {\left( {{a\bigvee c}} \right)\bigwedge\left( {{a^{\prime}\bigvee c^{\prime}}} \right)\bigwedge\left( {{c\bigvee{a^{\prime}}}} \right)\bigwedge a^{\prime}\bigwedge c^{\prime}} \right)}}} \\{= {\exists{C^{\prime} \cdot \left( {{a\bigwedge{{c\bigwedge a^{\prime}\bigwedge c^{\prime}}}}} \right)}}} \\{= {{a\bigwedge{c}}}}\end{matrix}$

[0654] And so, using the lambda calculus version of

EX(

a

b)

[0655] and some boolean manipulation, we have obtained an expression

(

a

b)

[0656] that is equivalent to the expression

EX(

a

b)

[0657] in terms of this particular control graph.

[0658] A fixed point of a Lambda expression (L) is an expression (F)such that LF=F. For example, if

L=λx.(

x

y) and F=true, LF=true

y=true,

[0659] and so true is a fixed point of L. The least fixed point ofLambda expression λx.E_(x) is notated as μx.E_(x), and the greatestfixed point is notated vx.E_(x). Using these, we can describe two morefunctional representations:

EFΦ=μY.(Φ

EXY)

EGΦ=νY.(Φ

EXY)

[0660] C. Binary Decision Diagrams (BDDs)

[0661] The weakest link of symbolic model checking is the fact thatboolean manipulation is NP-hard and can require large amounts of spacein a computer's memory. However, it has been shown that for a largeclass of problems, binary decision diagrams can provide efficientrepresentations for state machine transition relations that are easy tomanipulate and combine. A BDD is a reduced representation of a truthtable for a boolean function. Although any boolean function can berepresented as a truth table, the truth table is exponential in sizerelative to the number of variables in the boolean function. A truthtable can be represented as a tree in which each boolean variablecorresponds to nodes and each assignment of values corresponds to adirected edge. FIG. 66A shows a truth table 6600 for a boolean “and”function 6602. FIG. 66B shows a truth tree 6604 that corresponds totruth table 6600.

[0662] Finding the truth value of an assignment is performed as follows.Starting from the root, edges are traversed until a leaf node isreached. The particular edge traversed from each node is labeled withthe value assigned to the corresponding boolean variable. The leaf nodereached contains the truth value of the assignment.

[0663]FIGS. 67A, B, and C show several reduced BDDs 6700, 6702, and 6704for truth tree 6604, shown in FIG. 66B. The procedure for looking uptruth values is to start at the root and traverse the edges labeled withthe value assigned to the corresponding variable.

[0664] Whereas the number of nodes in a truth tree grows exponentiallywith the number of variables in a corresponding function, a BDD thatgrows only polynomially can often be found. Reduction of a BDD isperformed with the aid of a cache, where reduced subgraphs can bestored. This cache can be a hash table. In the reduce algorithm, asdisclosed in BRYANT, R. E., “Symbolic Boolean Manipulation with OrderedBinary-Decision Diagrams,” ACM Computing Surveys 24, 3 (September 1992),293-318, assume that each BDD node is a tuple (n, 1, r), where n is thename of the corresponding variable, 1 is the BDD connected by the leftedge, and r is the BDD connected by the right edge. Furthermore, a BDDcache has two operations—(1) put(BDD), for placing BDDs in the cache,and (2) lookup(name, BDD, BDD) (where the BDD parameters are the 1 and rsubgraphs of the cached BDD)—for finding BDDs already cached andreturning null if none is found.

[0665] Thus a BDD can be created for any boolean function by enumeratingits truth table in the form of a truth tree and calling the “reduce”algorithm. However, this approach still suffers from exponential growth,since it requires exponential space for the truth tree until the BDD isconstructed. However, BDDs can be efficiently and methodically grownusing a procedure embodied in the “apply” algorithm as disclosed inBRYANT, Id. Apply generates a BDD that represents an arbitrary booleanoperation applied to two BDDs, i.e., Br such that B_(r)=B₁opB₂.

[0666] The apply algorithm is derived from Shannon expansions of thefunctions represented by the input BDDs. A Shannon expansion representsa boolean function as an expression containing partial evaluations. Forexample, expanding f(x, y, z) in terms of x:

f(x, y, z)=(

x

f(1, y, z))

(

x

f(0, y, z))

[0667] which can, of course, be further expanded in terms of y and z. Itis typically notated

f=(

x

f| _(x←0))

(

x

f| _(x←1)).

[0668] This is also known as the cofactor expansion of a function. AShannon expansion of f op g is:

f op g=

[x^ ( f|x←o op g|x←0)]

[x

(f|x←1 op g|x←1)] Algorithm 3 BDD reduce BDD reduce (T) Require: T is aBDD C _(←) new BDD cache C.put(new BDD(1, null, null)) // cache leafnodes C.put(new BDD(0, null, null)) return reduceC (T, C) end BDDreduceC(T, C) Require: T is a BDD, and C is a BDD cache if T = null thenreturn null else l _(←) reduceC(T.l, C) r _(←) reduceC(T.r, C) end ifreturn C:lookup(T.n, l, r) end Algorithm 4 BDD apply BDD apply (op, B ₁,B₂) Require: op is an operator; B ₁ and B ₂ are BDDs C _(←) new BDDcache C.put(1, null, null)) // cache leaf nodes C.put(0 null, null)return applyC(op, B ₁, B ₂, C) end BDD applyC(op, B ₁, B₂, C) if depth(B ₁, n) = depth(B ₂.n) then // Same variable n _(←) B ₁, .n l _(←)applyC (op, B ₁, l, B ₂, l, C) r _(←) applyC (op, B ₁, r, B ₂, r, C)else if depth (B ₁.n) < depth(B ₂.n) then // B ₁ precedes B ₂ n _(←) B₁, .n l _(←) applyC (op, B ₁, l, B ₂, C) r _(←) applyC (op, B ₁, r, B ₂,C) else // B ₂ precedes B ₁ n _(←) B ₂, .n l _(←) applyC (op, B ₁, B₂,l, C) r _(←) applyC (op, B ₁, B₂, r, C) end if return C.lookup(n, l, r)end

[0669] D. BDD Representations of Control Graphs

[0670] A BDD can often be used to represent transition relations withexponential state-spaces while using only polynomial storage space. ABDD can be constructed from a control graph by unrolling the controlgraph, as described above, and then using the apply algorithm, describedabove, to build a BDD from a characteristic function for the unrolledcontrol graph. To efficiently represent the state-space, attention mustbe paid to variable ordering within the characteristic function. Goodorderings have contiguous sequences of highly correlated variables.

[0671] Consider DCG 6102, which represents the unrolled three-clientrendezvous/preempt coordinator, shown in FIG. 61. This example isinteresting because the state-space represented grows exponentially withthe number of participants. BDD representations would be of little valuefor our purposes if they experienced exponential growth in creation orif they required exponential storage. As shown below, BDDs providecompact representations for such examples.

[0672]FIG. 68 shows the results of using the apply algorithm to grow aBDD 6800, which represents the characteristic function of unrolled DCG6502 from FIG. 65.

[0673] An unrolled DCG contains a great deal of information that can aida designer in finding an efficient variable ordering. Constraints andpseudo-constraints (i.e., those introduced by unrolling) connectvariables that are likely to have the least change with respect to eachother. For example, given the following constraints:

{ . . . ,

C

G, C

B, B

C, . . . }

[0674] if C

G is uncontested, we know that C and G are fairly strong candidates tobe located next to each other in the variable ordering. However, C

B and B

C suggest that B and C are even stronger candidates.

[0675]FIG. 69A shows an unrolled rendezvous DCG 6900. With reference toFIG. 69A, unrolled rendezvous DCG 6900 is created by temporallyunrolling DCG 6102, which represents the rendezvous coordinator, asdescribed with reference to FIG. 61. The characteristic functions forunrolled rendezvous DCG 6900 are as follows:

f _(a)=wait_(a)

(

wait_(b)

wait_(c))

avail

wait′_(a),

avail′

f _(b)=wait_(b)

wait_(c)

avail

wait′_(b),

avail′

fc=wait_(c)

avail

wait′_(c),

avail′

[0676] The DCG shows that waitxs could be strong candidates forcolocation with their respective wait′_(x)s, with wait_(a), wait′_(a),wait_(b), wait′_(b), wait_(c), wait′_(c), avail, avail′ as a reasonableorder. The above variable ordering yields a BDD that has fifteen nodes,which can be referred to as a temporal cluster.

[0677] For further improvement, notice that each of the primed waitnodes depends upon unprimed versions of wait nodes adjacent to and belowit in the graph (e.g., wait′_(a) depends upon wait_(a),

wait_(b), and

wait_(c)). This suggests that wait_(a), wait_(b), and wait_(c) shouldall precede wait′_(a) in a BDD, wait_(b) and wait_(b) should precedewait′_(c), and so forth. This, combined with the order constraintsuggested in the last paragraph, indicates that wait_(c), wait′_(c),wait_(b), wait′_(b), wait_(a), wait′_(a), avail, avail′ should be a verygood ordering. In fact, it yields a BDD with eleven nodes. This variableordering will be referred to as cluster/depend. While cluster/depend isonly slightly better than the order given in the previous example, itoffers an advantage when the number of rendezvous wait participantsincreases from three. As shown in Table 8, the simple temporal clusterorder creates BDDs that consistently have around four times the numberof nodes as there are variables. The cluster/depend order creates BDDsthat are only around twice the number of participants. Both, however,are linear in size based on the number of participants. This is muchbetter than the boolean expression form of the characteristic function,which grows quadratically with the number of participants. TABLE 8Growth rate for BDDs representing temporally unfolded rendezvous. waitparticipants temporal cluster cluster/depend  3 15 11  8 35 21 15 63 3525 103  55 50 203  105 

[0678]FIG. 69B shows that the critical factor in ordering for this DCGis really just the order of wait_(c), wait_(b), and wait_(a) withrespect to each other. The ordering shown in FIG. 69B is equivalent inquality to cluster/depend in quality.

[0679] With these compact and canonical symbolic representations ofrendezvous and other coordinators' exponential statespaces, we can applymodel checking techniques to perform preemptive debugging and catch manybugs before implementation soft-ware is synthesized, compiled, and run.

[0680] E. Application of Model Checking

[0681] Model checking is performed using McMillan's AndExists algorithm,disclosed in, McMILLAN, K. L., Symbolic Model Checking: An Approach tothe State Explosion Problem, Ph.D. thesis, Carnegie Mellon University,1992. AndExists evaluates

λx.∃V.(

p

q)

[0682] where p and q are boolean expressions represented as BDDs and Vis a boolean assignment vector

(ν₁∈{false, true}).

[0683] In conjunction with

EXΦ=λx.∃C.((x ^(c) >C)

(

C

Φ))

[0684] the above allows EXΦ to be computed.

[0685] Many system properties can be verified by checking multiplecomponents simultaneously. Some examples for deadlock

(∃

S:S

AGfalse)

[0686] and livelock

(∃S:S|=AFS).

[0687] An important check is determining whether a software systemalways converges on a consistent state. When a DCG is partitioned amongseveral subsystems (e.g., on a multiprocessor architecture), anaction-only barrier is formed between the portions on each subsystem.Frequently, the actions that cross have delays that can span severalscheduling steps, and these delays are often functions of bus traffic orother factors. To make such a system match the semantics described here,it is necessary to synchronize all subsystems so that action delays arenever more than a single scheduling step. However, this is tooconservative for an execution model, and it eliminates some of theadvantage of multiprocessor architectures. It often makes more sense tolet components interact asynchronously when possible and to ensure thatthe system will always resolve to a consistent state. To do so, it isnecessary to choose an execution model that represents asynchrony.

[0688] Interleaving asynchronous models assume that processes can changestate at any time—but only one at a time. However, this assumption maybe too conservative, since several correlations can be made betweenstate-bits on a single component. It is more accurate to say that anycomponent can perform any locally legal control state change at anytime. As shown in FIGS. 70A, B, C, and D, this means that one momentafter the configuration shown in FIG. 70A is valid, either of theconfigurations shown in FIG. 70B or FIG. 70C can be valid. However, theconfiguration shown in FIG. 70D cannot be valid, because it reflects twosimultaneous state changes. However, since these changes are concurrent,the configuration in FIG. 70D may result after two time steps of theconfiguration in FIG. 70A.

[0689] F. Look-Ahead Predicates

[0690] Using model checking to perform inquiries into a particularproperty in a DCG produces a boolean function that identifies allconfigurations for which it is true.

[0691] In debugging a system, designers want to track whetherconfigurations in a particular execution can possibly lead to aconfiguration in which a particular predicate holds and whichconfigurations would do so. The expressions derived from the applicationof model checking are look-ahead predicates.

[0692] Control/Dataflow Graphs (CDGs)

[0693]FIG. 71 shows a control/dataflow graph (CDG) 7100. CDG 7100represents dataflow-based transparent actions 7102 and 7104 as part ofits structure. CDG 7100 has the same overall structure as a DCG butfurther allows both data ports 7106 and 7108 and dataflow nodes 7110 and7112. Dataflow actions are allowed, within this structure, to causecontrol changes.

[0694] A CDG is a triple, G=(Γf, Δf ;Ef), in which:

[0695] Γf is a set of conjunctive and action nodes, i.e., Γf

(C∪A), in which C is the universe of all possible conjunctive nodes andA is the universe of all possible actions.

[0696] Δf is a set of disjunctive and dataflow nodes, i.e., Δf

(D∪F), in which D is the universe of all possible disjunctive nodes andF is the universe of all possible dataflow nodes.

[0697] Ef

[{Tf, Tt}×{Hf, Ht}×{T, N}×((Γf×Δf)∪(Δf×Γf))] is a set of directed edgesthat are labeled either Transient (T) or continuous (N).

[0698] A. Transaction/Constraint Conflicts

[0699] Transactions are part of the definition of coordinators. However,they are not usually executed directly by the coordinator. Transactionsmay be initiated by components, and responses must often be generated bycomponents rather than by the coordinator itself. Coordinators containconstraints that can play a role in enforcing these semantics, and it isimportant to ensure that a component's constraints are consistent withthe transactions specified for the coordinator.

[0700] 1. Dedicated RPC

[0701]FIG. 72 shows a CDG 7200 representation of an RPC system. CDG 7200does not consider any coupling between control and data. There are twoaspects to control. The first aspect of control is steady-state control,in which there is global control of transitions from one state toanother. The second aspect of control is the interaction between controland data. Control typically must consider how the software system dealswith the transfer of data, in the form of parameters, for the recipientand return values.

[0702] B. Partitioning CDGs Across Message-Only Barriers

[0703] Just as SCGs and DCGs can be partitioned across action-onlybarriers through template-based graph transformations, both they andCDGs can be partitioned across barriers that permit only messagetraffic. The first stage in this transformation is as described abovewith respect to SCGs and CDGs. FIG. 73A shows the second step inpartitioning a CDG 7300 across an action-only barrier 7302. Withreference to FIG. 73A, the second step is to transform each action onthe barrier into supplementary actions and messages. FIG. 73B shows thegraph of DCG 6202, from FIG. 62B, with the action-only barriertransformed to a message-only barrier 7304.

[0704] C. Dataflow Consistency and SDF Extraction

[0705] Dataflow consistency means that when the system is viewed as awhole, production rates are compatible with their related consumptionrates. A system that is consistent in terms of dataflow will have noconfiguration in which production exceeds consumption on any path.Failure to detect such inconsistencies in advance could result indifficult to track memory leak bugs and bugs wherein data is dropped oroverwritten before it is processed (depending on queue-managementpolicies).

[0706] Synchronous dataflow (SDF) extraction for these dataflow graphs(DAG) is performed in two parts. First, a constant rate sub-DAG isfenced off through a pair of cuts. A constant-rate sub-DAG is a directedacyclic dataflow graph in which each port has constant token rates.Given a constant rate sub-DAG, each edge is a linear relation betweentwo dataflow actions. Second, by solving a consistent series of linearequations, schedule components can be found for each node. FIG. 74Ashows a CDG 7400 with a set of message rate guarantees 7402. FIG. 74Bshows a dataflow graph 7404 based on CDG 7400. With reference to FIGS.74A and B, the start of each edge 7406 is labeled with a production rate7408 and the sink is labeled with a consumption rate 7410. Whileinconsistent series of equations yield no solution, they also representinconsistent dataflow. Hence, one algorithm can be used for bothdataflow consistency and SDF extraction.

[0707] 1. Finding Normalized Schedule Coefficients

[0708] Schedule coefficients indicate the number of times a particularcomponent is to be executed sequentially to consume all tokens generatedby a component earlier in the dataflow graph. The problem is now to finda set of minimum practical scheduling coefficients. For dataflow actionsa, b, c, and d (shown in FIG. 74B) the minimum scheduling coefficientsare notated A₀, B₀, C₀, and D₀. Practical schedule coefficients must benatural numbers because it is typically not possible to execute acomponent a fractional number of times. However, in obtaining these weneed an intermediate step using rational relative schedule coefficients.Relative schedule coefficients are the minimum coefficients divided by aparticular component's coefficient. The schedule coefficient forcomponent a normalized to component d's coefficient is notated A_(d).The value of D_(d) is defined to be 1.

[0709] Finding relative schedule coefficients is accomplished byderiving a set of relative scheduling constraints between pairs ofnodes. Each internal edge in a constant rate cluster must have constanttoken rates on each side. The relative rates for two nodes, x and y,connected by an edge with token rates of m on the x side and n on the yside is nx=my, which means that whenever x fires n times, y must fire mtimes. We can then relate their scheduling coefficients as mX=nY. Thisrelationship holds for all valid practical and relative schedulingcoefficients, not just the minimum coefficients.

[0710] Normalized scheduling rates can be expressed in terms of a singlecomponent reference. To do so, each set of relative rates is firstexpressed as a linear equation. For example, let A_(z) be the normalizedschedule coefficient for component a and B_(z) be the normalizedschedule coefficient for component b. We express their relativecoefficient relationship as nA_(z)−mB_(z)=0. And so, for the relativeratio matrix −R, we are solving for the schedule coefficient vector interms of z or

^ Γ_(z), with −R·^ Γ _(z)=^ 1_(z),

[0711] where ^ 1_(z) is a vector that is zero in all except the zposition.

[0712] For dataflow graph 7404, using d as the reference component thisresults in the following: $R = {{\begin{pmatrix}3 & 0 & {- 2} & 0 \\0 & 4 & {- 3} & 0 \\0 & 0 & 5 & {- 2} \\0 & 0 & 0 & 1\end{pmatrix}\quad {and}\quad 1_{d}} = \begin{pmatrix}0 \\0 \\0 \\1\end{pmatrix}}$

[0713] Solving R·Γ_(d)=1_(d for) Γ_(d), we have:$\Gamma_{d} = {\begin{pmatrix}A_{d} \\B_{d} \\C_{d} \\D_{d}\end{pmatrix} = \begin{pmatrix}{4/15} \\{3/10} \\{2/5} \\1\end{pmatrix}}$

[0714] Notice that the top three rows of −R correspond to the threeedges in dataflow graph 7404.

[0715] Notice that ^ Γ_(d) has a solution if and only if the dataflowgraph is consistent. Graph inconsistency is only possible if there areat least as many edges as actions, i.e., if the system of equations isoverdefined. If Gaussian elimination is used for solving the matrixsystem, then the goal with an overdefined system is to cancel out onerow (typically the bottom), making that row all zeros. If the dataflowgraph is inconsistent, then the result is dataflow graph 7404 will havea row of the form [0 0 0 0 . . . 0|x], where x≠0.

[0716]FIG. 75 shows a dataflow graph 7500. With reference to dataflowgraph 7500, the starting Gaussian matrix is as follows: $\begin{pmatrix}5 & {- 2} & 0 & 0 & 3 \\1 & 0 & {- 2} & 0 & {15/2} \\0 & 2 & 0 & {- 4} & {3/2} \\0 & 0 & 2 & {- 3} & {15/4} \\0 & 0 & 0 & 1 & {{- 11}/4}\end{pmatrix}\quad$

[0717] After elimination and attempting to zero out the bottom row, theresult is as follows: $\begin{pmatrix}1 & 0 & 0 & 0 & 3 \\0 & 1 & 0 & 0 & {15/2} \\0 & 0 & 1 & 0 & {3/2} \\0 & 0 & 0 & 1 & {15/4} \\0 & 0 & 0 & 0 & {{- 11}/4}\end{pmatrix}\quad$

[0718] The result shown above indicates that the system represented bydataflow graph 7500 cannot be properly scheduled. Furthermore, evenwithout exact production and consumption rates, a similar procedure canbe use to catch inconsistent dataflow.

[0719] 2. Practical Schedule Coefficients

[0720] Executing a system requires all scheduling coefficients to beintegers. This can be facilitated by finding a lowest common denominator(LCD) of a set of normalized scheduling coefficients and multiplying theLCD through. For FIG. 74B, the LCD is 30, and so:$\Gamma_{0} = {\begin{pmatrix}A_{0} \\B_{0} \\C_{0} \\D_{0}\end{pmatrix} = \begin{pmatrix}8 \\9 \\12 \\30\end{pmatrix}}$

[0721] 3. Schedule Ordering

[0722] The order of components in the schedule must be consistent withcausality in the software system being modeled. A proper ordering can bedetermined by a topological sort of the dataflow graph (e.g., b, a, c, dfor the example in FIG. 74B). By combining scheduling coefficients,represented by vector Γ₀, with ordering, a complete consistent schedulecan be found. For dataflow graph 7404, the complete consistent scheduleis as follows: 9b·8a·12c·30d. This means execute action b 9 times whilebuffering results, then execute a 8 times while buffering results, thenexecute c 12 times while buffering results, and finally execute d 30times.

[0723] It will be obvious to those having skill in the art that manychanges may be made to the details of the above-described embodiment ofthis invention without departing from the underlying principles thereof.The scope of the present invention should, therefore, be determined onlyby the following claims.

1. A static error checking system for analyzing a software system inorder to detect design errors prior to system execution, the softwaresystem comprising a set of software elements which expose controlinteractions between the software elements, by representing the controlinteractions in a control graph, the control graph comprising: a set ofconjunctive nodes, each of which represents a conjunctive boolean guardon state changes within the software system; a set of disjunctive nodes,each of which represents a boolean guard on a functional object withinone of the software elements; a set of action nodes, each of whichrepresents a functional object within one of the software elements thatresponds to control interactions and produces control interactions; anda set of directed edges, each of which connect two nodes and representsimplication between the two nodes.
 2. A static error checking systemaccording to claim 1 wherein each edge of the set of directed edges hasan origin and a destination and only responds to a true value at theorigin.
 3. A static error checking system according to claim 1 whereineach edge of the set of directed edges has an origin and a destinationand only responds to a false value at the origin.
 4. A static errorchecking system according to claim 1 wherein each edge of the set ofdirected edges has an origin and a destination and only asserts a falsevalue at the destination.
 5. A static error checking system according toclaim 1 wherein each edge of the set of directed edges has an origin anda destination and only asserts a true value at the destination.
 6. Adata structure for representing control constraints and control actionsof a software system, the software system comprising at least twosoftware elements with explicit control interactions between thesoftware elements, the data structure comprising: a set of conjunctiveboolean guards on state changes within the software system; a set ofboolean guards on functional objects within the software elements; a setof functional control objects within the software elements, eachfunctional control object being responsive to a control interaction andcapable of producing a control interaction; and a set of relationalconnections between two elements from the set of conjunctive booleanguards, the set of boolean guards on objects, and the set of functionalcontrol objects, each pointer representing implication between the twoelements it connects.
 7. A data structure according to claim 6 whereineach edge of the set of directed edges has an origin and a destinationand only responds to a true value at the origin.
 8. A data structureaccording to claim 6 wherein each edge of the set of directed edges hasan origin and a destination and only responds to a false value at theorigin.
 9. A data structure according to claim 6 wherein each edge ofthe set of directed edges has an origin and a destination and onlyasserts a false value at the destination.
 10. A data structure accordingto claim 6 wherein each edge of the set of directed edges has an originand a destination and only asserts a true value at the destination. 11.A static error checking system for debugging a software system, thesoftware system comprising software elements which expose controlinteraction and data flow interactions between the software elements, byrepresenting the control interactions and the data flow interactions ina graph, the graph comprising: a set of conjunctive nodes, each of whichrepresents a conjunctive boolean guard on state changes within thesoftware system; a set of disjunctive nodes, each of which represents aboolean guard on a functional object within one of the softwareelements; a set of action nodes, each of which represents a functionalobject within one of the software elements that is responsive to acontrol interaction and capable of producing a control interaction; aset of data flow nodes, each of which represents a data flow interactionbetween a first and a second software elements; and a set of directededges, each of which connects a first node to a second node andrepresents implication between the first and second nodes.
 12. A staticerror checking system according to claim 11 wherein each edge of the setof directed edges has an origin and a destination and only responds to atrue value at the origin.
 13. A static error checking system accordingto claim 11 wherein each edge of the set of directed edges has an originand a destination and only responds to a false value at the origin. 14.A static error checking system according to claim 11 wherein each edgeof the set of directed edges has an origin and a destination and onlyasserts a false value at the destination.
 15. A static error checkingsystem according to claim 11 wherein each edge of the set of directededges has an origin and a destination and only asserts a true value atthe destination.
 16. A data structure for representing controlconstraints and control actions of a software system, the softwaresystem comprising at least two software elements with explicit controlinteractions between the software elements, the data structurecomprising: a set of conjunctive boolean guards on state changes withinthe software system; a set of boolean guards on functional objectswithin the software elements; a set of functional control objects withinthe software elements, each functional control object being responsiveto a control interaction and capable of producing a control interaction;a set of data flow nodes, each of which represents a data flowinteraction between a first and a second software elements; and a set ofrelational connections between two elements from the set of conjunctiveboolean guards, the set of boolean guards on objects, and the set offunctional control objects, each pointer representing implicationbetween the two elements it connects.
 17. A data structure according toclaim 16 wherein each edge of the set of directed edges has an originand a destination and only responds to a true value at the origin.
 18. Adata structure according to claim 16 wherein each edge of the set ofdirected edges has an origin and a destination and only responds to afalse value at the origin.
 19. A data structure according to claim 16wherein each edge of the set of directed edges has an origin and adestination and only asserts a false value at the destination.
 20. Adata structure according to claim 16 wherein each edge of the set ofdirected edges has an origin and a destination and only asserts a truevalue at the destination.
 21. A method for converting a control graphrepresentation of a software system, having a state space and an initialstate, into a binary decision diagram of the software system comprising:transforming the control graph to express a potential next state of thesoftware system after a predetermined period of time; and generating abinary decision diagram based on the transformed control graph, wherebyknown static error checking techniques may be used to further identifyany unexpected behavior of the software system without incurring thecost of fully elaborating the state space of the software system.
 22. Amethod according to claim 21 wherein transforming the control graphcomprises unrolling the control graph.
 23. A method according to claim22 wherein generating the binary decision diagram comprises using theapply algorithm on a characteristic function of the unrolled controlgraph.
 24. A method according to claim 23 wherein unrolling the controlgraph comprises: creating a copy of each disjunctive node, eachdisjunctive node represents a boolean guard on a functional objectwithin one of the software elements; creating a copy of each conjunctivenode, each conjunctive node represents a conjunctive boolean guard onstate changes within the software system; creating a copy of each actionnode, each action node represents a functional object within one of thesoftware elements that is responsive to a control interaction andcapable of producing a control interaction, if the functional object itrepresents performs a predetermined function without a predetermineddelay; for each delayed action node which represents a functional objectwithin one of the software elements that has a predetermined delay inresponding to or producing, a control interaction, creating a sensingedge to connect the delayed action node to a corresponding node in thecontrol graph representing the initial state of the system and creatingan outgoing edge to connect the corresponding node, in the control graphrepresenting the initial state of the system, to a corresponding nextnode, which represent the potential next state of the system; foroutgoing edge, that is also an event edge, connecting the outgoing edgeto a create event disjunctive node, which represents an event generatedby the corresponding node in the control graph representing the initialstate of the system; for each created event disjunctive node, creatingan edge from the created event disjunctive node to an event conjunctivenode; for each event conjunctive node, creating an edge from the nodethat generated the event to the event conjunctive node, and creating anedge from the event conjunctive node to the copy of the node thatgenerated the event.
 25. A bit vector for use in debugging softwaresystems, the software system comprising at least a first and secondcomponent and a coordinator for implementing a predeterminedcoordination scheme for managing control and dataflow interactionsbetween the first and second components, the first and second componentsconnected to the coordinator by a first and second pair of complimentarycoordination interfaces, respectively, each coordination interface inthe first pair of complimentary coordination interfaces comprising acontrol port for transferring control state between the respectivecomponent and the coordinator, each control port having a control statevalue representing on or off, the bit vector comprising: one bitcorresponding to each control port within the software system, each bithaving a boolean value representing the control state value of itscorresponding control port at a predetermined time.